Merge pull request #50419 from ClickHouse/reddit-fixes

Reddit dataset fixes
This commit is contained in:
Dan Roscigno 2023-06-01 10:30:56 -04:00 committed by GitHub
commit c70aa9592b
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23

View File

@ -5,7 +5,7 @@ sidebar_label: Reddit comments
# Reddit comments dataset # Reddit comments dataset
This dataset contains publicly-available comments on Reddit that go back to December, 2005, to March, 2023, and contains over 7B rows of data. The raw data is in JSON format in compressed `.zst` files and the rows look like the following: This dataset contains publicly-available comments on Reddit that go back to December, 2005, to March, 2023, and contains over 14B rows of data. The raw data is in JSON format in compressed files and the rows look like the following:
```json ```json
{"controversiality":0,"body":"A look at Vietnam and Mexico exposes the myth of market liberalisation.","subreddit_id":"t5_6","link_id":"t3_17863","stickied":false,"subreddit":"reddit.com","score":2,"ups":2,"author_flair_css_class":null,"created_utc":1134365188,"author_flair_text":null,"author":"frjo","id":"c13","edited":false,"parent_id":"t3_17863","gilded":0,"distinguished":null,"retrieved_on":1473738411} {"controversiality":0,"body":"A look at Vietnam and Mexico exposes the myth of market liberalisation.","subreddit_id":"t5_6","link_id":"t3_17863","stickied":false,"subreddit":"reddit.com","score":2,"ups":2,"author_flair_css_class":null,"created_utc":1134365188,"author_flair_text":null,"author":"frjo","id":"c13","edited":false,"parent_id":"t3_17863","gilded":0,"distinguished":null,"retrieved_on":1473738411}
@ -18,7 +18,7 @@ This dataset contains publicly-available comments on Reddit that go back to Dece
A shoutout to Percona for the [motivation behind ingesting this dataset](https://www.percona.com/blog/big-data-set-reddit-comments-analyzing-clickhouse/), which we have downloaded and stored in an S3 bucket. A shoutout to Percona for the [motivation behind ingesting this dataset](https://www.percona.com/blog/big-data-set-reddit-comments-analyzing-clickhouse/), which we have downloaded and stored in an S3 bucket.
:::note :::note
The following commands were executed on ClickHouse Cloud. To run this on your own cluster, replace `default` in the `s3Cluster` function call with the name of your cluster. If you do not have a cluster, then replace the `s3Cluster` function with the `s3` function. The following commands were executed on a Production instance of ClickHouse Cloud with the minimum memory set to 720GB. To run this on your own cluster, replace `default` in the `s3Cluster` function call with the name of your cluster. If you do not have a cluster, then replace the `s3Cluster` function with the `s3` function.
::: :::
1. Let's create a table for the Reddit data: 1. Let's create a table for the Reddit data:
@ -75,18 +75,6 @@ The names of the files in S3 start with `RC_YYYY-MM` where `YYYY-MM` goes from `
2. We are going to start with one month of data, but if you want to simply insert every row - skip ahead to step 8 below. The following file has 86M records from December, 2017: 2. We are going to start with one month of data, but if you want to simply insert every row - skip ahead to step 8 below. The following file has 86M records from December, 2017:
```sql
INSERT INTO reddit
SELECT *
FROM s3Cluster(
'default',
'https://clickhouse-public-datasets.s3.eu-central-1.amazonaws.com/reddit/original/RC_2017-12.xz',
'JSONEachRow'
);
```
If you do not have a cluster, use `s3` instead of `s3Cluster`:
```sql ```sql
INSERT INTO reddit INSERT INTO reddit
SELECT * SELECT *
@ -94,6 +82,7 @@ INSERT INTO reddit
'https://clickhouse-public-datasets.s3.eu-central-1.amazonaws.com/reddit/original/RC_2017-12.xz', 'https://clickhouse-public-datasets.s3.eu-central-1.amazonaws.com/reddit/original/RC_2017-12.xz',
'JSONEachRow' 'JSONEachRow'
); );
``` ```
3. It will take a while depending on your resources, but when it's done verify it worked: 3. It will take a while depending on your resources, but when it's done verify it worked:
@ -198,26 +187,81 @@ LIMIT 10;
TRUNCATE TABLE reddit; TRUNCATE TABLE reddit;
``` ```
8. This is a fun dataset and it looks like we can find some great information, so let's go ahead and insert the entire dataset from 2005 to 2023. When you're ready, run this command to insert all the rows. (It takes a while - up to 17 hours!) 8. This is a fun dataset and it looks like we can find some great information, so let's go ahead and insert the entire dataset from 2005 to 2023. For practical reasons, it works well to insert the data by years starting with...
```sql
INSERT INTO reddit
SELECT *
FROM s3Cluster(
'default',
'https://clickhouse-public-datasets.s3.eu-central-1.amazonaws.com/reddit/original/RC_2005*',
'JSONEachRow'
)
SETTINGS zstd_window_log_max = 31;
```
...and ending with:
```sql ```sql
INSERT INTO reddit INSERT INTO reddit
SELECT * SELECT *
FROM s3Cluster( FROM s3Cluster(
'default', 'default',
'https://clickhouse-public-datasets.s3.amazonaws.com/reddit/original/RC*', 'https://clickhouse-public-datasets.s3.amazonaws.com/reddit/original/RC_2023*',
'JSONEachRow' 'JSONEachRow'
) )
SETTINGS zstd_window_log_max = 31; SETTINGS zstd_window_log_max = 31;
``` ```
The response looks like: If you do not have a cluster, use `s3` instead of `s3Cluster`:
```response ```sql
0 rows in set. Elapsed: 61187.839 sec. Processed 6.74 billion rows, 2.06 TB (110.17 thousand rows/s., 33.68 MB/s.) INSERT INTO reddit
SELECT *
FROM s3(
'https://clickhouse-public-datasets.s3.amazonaws.com/reddit/original/RC_2005*',
'JSONEachRow'
)
SETTINGS zstd_window_log_max = 31;
``` ```
8. Let's see how many rows were inserted and how much disk space the table is using: 8. To verify it worked, here are the number of rows per year (as of February, 2023):
```sql
SELECT
toYear(created_utc) AS year,
formatReadableQuantity(count())
FROM reddit
GROUP BY year;
```
```response
┌─year─┬─formatReadableQuantity(count())─┐
│ 2005 │ 1.07 thousand │
│ 2006 │ 417.18 thousand │
│ 2007 │ 2.46 million │
│ 2008 │ 7.24 million │
│ 2009 │ 18.86 million │
│ 2010 │ 42.93 million │
│ 2011 │ 28.91 million │
│ 2012 │ 260.31 million │
│ 2013 │ 402.21 million │
│ 2014 │ 531.80 million │
│ 2015 │ 667.76 million │
│ 2016 │ 799.90 million │
│ 2017 │ 972.86 million │
│ 2018 │ 1.24 billion │
│ 2019 │ 1.66 billion │
│ 2020 │ 2.16 billion │
│ 2021 │ 2.59 billion │
│ 2022 │ 2.82 billion │
│ 2023 │ 474.86 million │
└──────┴─────────────────────────────────┘
```
9. Let's see how many rows were inserted and how much disk space the table is using:
```sql ```sql
@ -227,17 +271,17 @@ SELECT
formatReadableSize(sum(bytes)) AS disk_size, formatReadableSize(sum(bytes)) AS disk_size,
formatReadableSize(sum(data_uncompressed_bytes)) AS uncompressed_size formatReadableSize(sum(data_uncompressed_bytes)) AS uncompressed_size
FROM system.parts FROM system.parts
WHERE (table = 'reddit') AND active WHERE (table = 'reddit') AND active;
``` ```
Notice the compression of disk storage is about 1/3 of the uncompressed size: Notice the compression of disk storage is about 1/3 of the uncompressed size:
```response ```response
┌──────count─┬─formatReadableQuantity(sum(rows))─┬─disk_size─┬─uncompressed_size─┐ ┌──────count─┬─formatReadableQuantity(sum(rows))─┬─disk_size─┬─uncompressed_size─┐
6739503568 │ 6.74 billion │ 501.10 GiB │ 1.51 TiB │ 14688534662 │ 14.69 billion │ 1.03 TiB │ 3.26 TiB │
└────────────┴───────────────────────────────────┴───────────┴───────────────────┘ └────────────┴───────────────────────────────────┴───────────┴───────────────────┘
1 row in set. Elapsed: 0.010 sec. 1 row in set. Elapsed: 0.005 sec.
``` ```
9. The following query shows how many comments, authors and subreddits we have for each month: 9. The following query shows how many comments, authors and subreddits we have for each month:
@ -256,10 +300,10 @@ GROUP BY firstOfMonth
ORDER BY firstOfMonth ASC; ORDER BY firstOfMonth ASC;
``` ```
This is a substantial query that has to process all 6.74 billion rows, but we still get an impressive response time (about 3 minutes): This is a substantial query that has to process all 14.69 billion rows, but we still get an impressive response time (about 48 seconds):
```response ```response
┌─firstOfMonth─┬─────────c─┬─bar_count─────────────────┬─authors─┬─bar_authors───────────────┬─subreddits─┬─bar_subreddits────────────┐ ┌─firstOfMonth─┬─────────c─┬─bar_count─────────────────┬─authors─┬─bar_authors───────────────┬─subreddits─┬─bar_subreddits────────────┐
│ 2005-12-01 │ 1075 │ │ 394 │ │ 1 │ │ │ 2005-12-01 │ 1075 │ │ 394 │ │ 1 │ │
│ 2006-01-01 │ 3666 │ │ 791 │ │ 2 │ │ │ 2006-01-01 │ 3666 │ │ 791 │ │ 2 │ │
│ 2006-02-01 │ 9095 │ │ 1464 │ │ 18 │ │ │ 2006-02-01 │ 9095 │ │ 1464 │ │ 18 │ │
@ -315,24 +359,20 @@ This is a substantial query that has to process all 6.74 billion rows, but we st
│ 2010-04-01 │ 3209898 │ █▌ │ 128936 │ ▋ │ 3170 │ ▊ │ │ 2010-04-01 │ 3209898 │ █▌ │ 128936 │ ▋ │ 3170 │ ▊ │
│ 2010-05-01 │ 3267363 │ █▋ │ 131851 │ ▋ │ 3166 │ ▊ │ │ 2010-05-01 │ 3267363 │ █▋ │ 131851 │ ▋ │ 3166 │ ▊ │
│ 2010-06-01 │ 3532867 │ █▊ │ 139522 │ ▋ │ 3301 │ ▊ │ │ 2010-06-01 │ 3532867 │ █▊ │ 139522 │ ▋ │ 3301 │ ▊ │
│ 2010-07-01 │ 4032737 │ ██ │ 153451 │ ▊ │ 3662 │ ▉ │ 2010-07-01 │ 806612 │ ▍ │ 76486 │ ▍ │ 1955 │ ▍
│ 2010-08-01 │ 4247982 │ ██ │ 164071 │ ▊ │ 3653 │ ▉ │ │ 2010-08-01 │ 4247982 │ ██ │ 164071 │ ▊ │ 3653 │ ▉ │
│ 2010-09-01 │ 4704069 │ ██▎ │ 186613 │ ▉ │ 4009 │ █ │ │ 2010-09-01 │ 4704069 │ ██▎ │ 186613 │ ▉ │ 4009 │ █ │
│ 2010-10-01 │ 5032368 │ ██▌ │ 203800 │ █ │ 4154 │ █ │ │ 2010-10-01 │ 5032368 │ ██▌ │ 203800 │ █ │ 4154 │ █ │
│ 2010-11-01 │ 5689002 │ ██▊ │ 226134 │ █▏ │ 4383 │ █ │ │ 2010-11-01 │ 5689002 │ ██▊ │ 226134 │ █▏ │ 4383 │ █ │
│ 2010-12-01 │ 5972642 │ ██▉ │ 245824 │ █▏ │ 4692 │ █▏ │ │ 2010-12-01 │ 3642690 │ █▊ │ 196847 │ ▉ │ 3914 │ ▉ │
│ 2011-01-01 │ 6603329 │ ███▎ │ 270025 │ █▎ │ 5141 │ █▎ │ │ 2011-01-01 │ 3924540 │ █▉ │ 215057 │ █ │ 4240 │ █ │
│ 2011-02-01 │ 6363114 │ ███▏ │ 277593 │ █▍ │ 5202 │ █▎ │ │ 2011-02-01 │ 3859131 │ █▉ │ 223485 │ █ │ 4371 │ █ │
│ 2011-03-01 │ 7556165 │ ███▊ │ 314748 │ █▌ │ 5445 │ █▎ │ │ 2011-03-01 │ 2877996 │ █▍ │ 208607 │ █ │ 3870 │ ▉ │
│ 2011-04-01 │ 7571398 │ ███▊ │ 329920 │ █▋ │ 6128 │ █▌ │ │ 2011-04-01 │ 3859131 │ █▉ │ 248931 │ █▏ │ 4881 │ █▏ │
│ 2011-05-01 │ 8803949 │ ████▍ │ 365013 │ █▊ │ 6834 │ █▋ │ │ 2011-06-01 │ 3859131 │ █▉ │ 267197 │ █▎ │ 5255 │ █▎ │
│ 2011-06-01 │ 9766511 │ ████▉ │ 393945 │ █▉ │ 7519 │ █▉ │ │ 2011-08-01 │ 2943405 │ █▍ │ 259428 │ █▎ │ 5806 │ █▍ │
│ 2011-07-01 │ 10557466 │ █████▎ │ 424235 │ ██ │ 8293 │ ██ │ │ 2011-10-01 │ 3859131 │ █▉ │ 327342 │ █▋ │ 6958 │ █▋ │
│ 2011-08-01 │ 12316144 │ ██████▏ │ 475326 │ ██▍ │ 9657 │ ██▍ │ │ 2011-12-01 │ 3728313 │ █▊ │ 354817 │ █▊ │ 7713 │ █▉ │
│ 2011-09-01 │ 12150412 │ ██████ │ 503142 │ ██▌ │ 10278 │ ██▌ │
│ 2011-10-01 │ 13470278 │ ██████▋ │ 548801 │ ██▋ │ 10922 │ ██▋ │
│ 2011-11-01 │ 13621533 │ ██████▊ │ 574435 │ ██▊ │ 11572 │ ██▉ │
│ 2011-12-01 │ 14509469 │ ███████▎ │ 622849 │ ███ │ 12335 │ ███ │
│ 2012-01-01 │ 16350205 │ ████████▏ │ 696110 │ ███▍ │ 14281 │ ███▌ │ │ 2012-01-01 │ 16350205 │ ████████▏ │ 696110 │ ███▍ │ 14281 │ ███▌ │
│ 2012-02-01 │ 16015695 │ ████████ │ 722892 │ ███▌ │ 14949 │ ███▋ │ │ 2012-02-01 │ 16015695 │ ████████ │ 722892 │ ███▌ │ 14949 │ ███▋ │
│ 2012-03-01 │ 17881943 │ ████████▉ │ 789664 │ ███▉ │ 15795 │ ███▉ │ │ 2012-03-01 │ 17881943 │ ████████▉ │ 789664 │ ███▉ │ 15795 │ ███▉ │
@ -426,15 +466,50 @@ This is a substantial query that has to process all 6.74 billion rows, but we st
│ 2019-07-01 │ 145965083 │ █████████████████████████ │ 6901822 │ █████████████████████████ │ 147802 │ █████████████████████████ │ │ 2019-07-01 │ 145965083 │ █████████████████████████ │ 6901822 │ █████████████████████████ │ 147802 │ █████████████████████████ │
│ 2019-08-01 │ 146854393 │ █████████████████████████ │ 6993882 │ █████████████████████████ │ 151888 │ █████████████████████████ │ │ 2019-08-01 │ 146854393 │ █████████████████████████ │ 6993882 │ █████████████████████████ │ 151888 │ █████████████████████████ │
│ 2019-09-01 │ 137540219 │ █████████████████████████ │ 7001362 │ █████████████████████████ │ 148839 │ █████████████████████████ │ │ 2019-09-01 │ 137540219 │ █████████████████████████ │ 7001362 │ █████████████████████████ │ 148839 │ █████████████████████████ │
│ 2019-10-01 │ 129771456 │ █████████████████████████ │ 6825690 │ █████████████████████████ │ 144453 │ █████████████████████████ │ │ 2019-10-01 │ 145909884 │ █████████████████████████ │ 7160126 │ █████████████████████████ │ 152075 │ █████████████████████████ │
│ 2019-11-01 │ 107990259 │ █████████████████████████ │ 6368286 │ █████████████████████████ │ 141768 │ █████████████████████████ │ │ 2019-11-01 │ 138512489 │ █████████████████████████ │ 7098723 │ █████████████████████████ │ 164597 │ █████████████████████████ │
│ 2019-12-01 │ 112895934 │ █████████████████████████ │ 6640902 │ █████████████████████████ │ 148277 │ █████████████████████████ │ │ 2019-12-01 │ 146012313 │ █████████████████████████ │ 7438261 │ █████████████████████████ │ 166966 │ █████████████████████████ │
│ 2020-01-01 │ 54354879 │ █████████████████████████ │ 4782339 │ ███████████████████████▉ │ 111658 │ █████████████████████████ │ │ 2020-01-01 │ 153498208 │ █████████████████████████ │ 7703548 │ █████████████████████████ │ 174390 │ █████████████████████████ │
│ 2020-02-01 │ 22696923 │ ███████████▎ │ 3135175 │ ███████████████▋ │ 79521 │ ███████████████████▉ │ │ 2020-02-01 │ 148386817 │ █████████████████████████ │ 7582031 │ █████████████████████████ │ 170257 │ █████████████████████████ │
│ 2020-03-01 │ 3466677 │ █▋ │ 987960 │ ████▉ │ 40901 │ ██████████▏ │ │ 2020-03-01 │ 166266315 │ █████████████████████████ │ 8339049 │ █████████████████████████ │ 192460 │ █████████████████████████ │
└──────────────┴───────────┴───────────────────────────┴─────────┴───────────────────────────┴────────────┴───────────────────────────┘ │ 2020-04-01 │ 178511581 │ █████████████████████████ │ 8991649 │ █████████████████████████ │ 202334 │ █████████████████████████ │
│ 2020-05-01 │ 189993779 │ █████████████████████████ │ 9331358 │ █████████████████████████ │ 217357 │ █████████████████████████ │
│ 2020-06-01 │ 187914434 │ █████████████████████████ │ 9085003 │ █████████████████████████ │ 223362 │ █████████████████████████ │
│ 2020-07-01 │ 194244994 │ █████████████████████████ │ 9321706 │ █████████████████████████ │ 228222 │ █████████████████████████ │
│ 2020-08-01 │ 196099301 │ █████████████████████████ │ 9368408 │ █████████████████████████ │ 230251 │ █████████████████████████ │
│ 2020-09-01 │ 182549761 │ █████████████████████████ │ 9271571 │ █████████████████████████ │ 227889 │ █████████████████████████ │
│ 2020-10-01 │ 186583890 │ █████████████████████████ │ 9396112 │ █████████████████████████ │ 233715 │ █████████████████████████ │
│ 2020-11-01 │ 186083723 │ █████████████████████████ │ 9623053 │ █████████████████████████ │ 234963 │ █████████████████████████ │
│ 2020-12-01 │ 191317162 │ █████████████████████████ │ 9898168 │ █████████████████████████ │ 249115 │ █████████████████████████ │
│ 2021-01-01 │ 210496207 │ █████████████████████████ │ 10503943 │ █████████████████████████ │ 259805 │ █████████████████████████ │
│ 2021-02-01 │ 193510365 │ █████████████████████████ │ 10215033 │ █████████████████████████ │ 253656 │ █████████████████████████ │
│ 2021-03-01 │ 207454415 │ █████████████████████████ │ 10365629 │ █████████████████████████ │ 267263 │ █████████████████████████ │
│ 2021-04-01 │ 204573086 │ █████████████████████████ │ 10391984 │ █████████████████████████ │ 270543 │ █████████████████████████ │
│ 2021-05-01 │ 217655366 │ █████████████████████████ │ 10648130 │ █████████████████████████ │ 288555 │ █████████████████████████ │
│ 2021-06-01 │ 208027069 │ █████████████████████████ │ 10397311 │ █████████████████████████ │ 291520 │ █████████████████████████ │
│ 2021-07-01 │ 210955954 │ █████████████████████████ │ 10063967 │ █████████████████████████ │ 252061 │ █████████████████████████ │
│ 2021-08-01 │ 225681244 │ █████████████████████████ │ 10383556 │ █████████████████████████ │ 254569 │ █████████████████████████ │
│ 2021-09-01 │ 220086513 │ █████████████████████████ │ 10298344 │ █████████████████████████ │ 256826 │ █████████████████████████ │
│ 2021-10-01 │ 227527379 │ █████████████████████████ │ 10729882 │ █████████████████████████ │ 283328 │ █████████████████████████ │
│ 2021-11-01 │ 228289963 │ █████████████████████████ │ 10995197 │ █████████████████████████ │ 302386 │ █████████████████████████ │
│ 2021-12-01 │ 235807471 │ █████████████████████████ │ 11312798 │ █████████████████████████ │ 313876 │ █████████████████████████ │
│ 2022-01-01 │ 256766679 │ █████████████████████████ │ 12074520 │ █████████████████████████ │ 340407 │ █████████████████████████ │
│ 2022-02-01 │ 219927645 │ █████████████████████████ │ 10846045 │ █████████████████████████ │ 293236 │ █████████████████████████ │
│ 2022-03-01 │ 236554668 │ █████████████████████████ │ 11330285 │ █████████████████████████ │ 302387 │ █████████████████████████ │
│ 2022-04-01 │ 231188077 │ █████████████████████████ │ 11697995 │ █████████████████████████ │ 316303 │ █████████████████████████ │
│ 2022-05-01 │ 230492108 │ █████████████████████████ │ 11448584 │ █████████████████████████ │ 323725 │ █████████████████████████ │
│ 2022-06-01 │ 218842949 │ █████████████████████████ │ 11400399 │ █████████████████████████ │ 324846 │ █████████████████████████ │
│ 2022-07-01 │ 242504279 │ █████████████████████████ │ 12049204 │ █████████████████████████ │ 335621 │ █████████████████████████ │
│ 2022-08-01 │ 247215325 │ █████████████████████████ │ 12189276 │ █████████████████████████ │ 337873 │ █████████████████████████ │
│ 2022-09-01 │ 234131223 │ █████████████████████████ │ 11674079 │ █████████████████████████ │ 326325 │ █████████████████████████ │
│ 2022-10-01 │ 237365072 │ █████████████████████████ │ 11804508 │ █████████████████████████ │ 336063 │ █████████████████████████ │
│ 2022-11-01 │ 229478878 │ █████████████████████████ │ 11543020 │ █████████████████████████ │ 323122 │ █████████████████████████ │
│ 2022-12-01 │ 238862690 │ █████████████████████████ │ 11967451 │ █████████████████████████ │ 331668 │ █████████████████████████ │
│ 2023-01-01 │ 253577512 │ █████████████████████████ │ 12264087 │ █████████████████████████ │ 332711 │ █████████████████████████ │
│ 2023-02-01 │ 221285501 │ █████████████████████████ │ 11537091 │ █████████████████████████ │ 317879 │ █████████████████████████ │
└──────────────┴───────────┴───────────────────────────┴──────────┴───────────────────────────┴────────────┴───────────────────────────┘
172 rows in set. Elapsed: 184.809 sec. Processed 6.74 billion rows, 89.56 GB (36.47 million rows/s., 484.62 MB/s.) 203 rows in set. Elapsed: 48.492 sec. Processed 14.69 billion rows, 213.35 GB (302.91 million rows/s., 4.40 GB/s.)
``` ```
10. Here are the top 10 subreddits of 2022: 10. Here are the top 10 subreddits of 2022:
@ -450,23 +525,21 @@ ORDER BY count DESC
LIMIT 10; LIMIT 10;
``` ```
The response is:
```response ```response
┌─subreddit────────┬───count─┐ ┌─subreddit──────┬────count─┐
│ AskReddit │ 3858203 │ AskReddit │ 72312060
politics │ 1356782 AmItheAsshole │ 25323210
memes │ 1249120 │ teenagers │ 22355960 │
nfl │ 883667 │ worldnews │ 17797707 │
worldnews │ 866065 FreeKarma4U │ 15652274
teenagers │ 777095 │ FreeKarma4You │ 14929055 │
AmItheAsshole │ 752720 wallstreetbets │ 14235271
dankmemes │ 657932 politics │ 12511136
nba │ 514184 memes │ 11610792
unpopularopinion │ 473649 nba │ 11586571
└──────────────────┴─────────┘ └────────────────┴──────────┘
10 rows in set. Elapsed: 27.824 sec. Processed 6.74 billion rows, 53.26 GB (242.22 million rows/s., 1.91 GB/s.) 10 rows in set. Elapsed: 5.956 sec. Processed 14.69 billion rows, 126.19 GB (2.47 billion rows/s., 21.19 GB/s.)
``` ```
11. Let's see which subreddits had the biggest increase in commnents from 2018 to 2019: 11. Let's see which subreddits had the biggest increase in commnents from 2018 to 2019:
@ -502,62 +575,62 @@ It looks like memes and teenagers were busy on Reddit in 2019:
```response ```response
┌─subreddit────────────┬─────diff─┐ ┌─subreddit────────────┬─────diff─┐
memes │ 15368369 │ AskReddit │ 18765909 │
AskReddit │ 14663662 memes │ 16496996
│ teenagers │ 12266991 │ teenagers │ 13071715
│ AmItheAsshole │ 11561538 │ AmItheAsshole │ 12312663
│ dankmemes │ 11305158 │ dankmemes │ 12016716
│ unpopularopinion │ 6332772 │ unpopularopinion │ 6809935
│ PewdiepieSubmissions │ 5930818 │ PewdiepieSubmissions │ 6330844
│ Market76 │ 5014668 │ Market76 │ 5213690
│ relationship_advice │ 3776383 │ relationship_advice │ 4060717
freefolk │ 3169236 Minecraft │ 3328659
Minecraft │ 3160241 freefolk │ 3227970
│ classicwow │ 2907056 │ classicwow │ 3063133
│ Animemes │ 2673398 │ Animemes │ 2866876
│ gameofthrones │ 2402835 │ gonewild │ 2457680
│ PublicFreakout │ 2267605 │ PublicFreakout │ 2452288
ShitPostCrusaders │ 2207266 gameofthrones │ 2411661
│ RoastMe │ 2195715 │ RoastMe │ 2378781
gonewild │ 2148649 ShitPostCrusaders │ 2345414
│ AnthemTheGame │ 1803818 │ AnthemTheGame │ 1813152
entitledparents │ 1706270 nfl │ 1804407
MortalKombat │ 1679508 │ Showerthoughts │ 1797968 │
│ Cringetopia │ 1620555 │ Cringetopia │ 1764034
│ pokemon │ 1615266 │ pokemon │ 1763269
HistoryMemes │ 1608289 entitledparents │ 1744852
Brawlstars │ 1574977 HistoryMemes │ 1721645
iamatotalpieceofshit │ 1558315 MortalKombat │ 1718184
│ trashy │ 1518549 │ trashy │ 1684357
│ ChapoTrapHouse │ 1505748 │ ChapoTrapHouse │ 1675363
Pikabu │ 1501001 Brawlstars │ 1663763
Showerthoughts │ 1475101 │ iamatotalpieceofshit │ 1647381 │
cursedcomments │ 1465607 ukpolitics │ 1599204
ukpolitics │ 1386043 cursedcomments │ 1590781
wallstreetbets │ 1384431 Pikabu │ 1578597
interestingasfuck │ 1378900 wallstreetbets │ 1535225
wholesomememes │ 1353333 AskOuija │ 1533214
AskOuija │ 1233263 interestingasfuck │ 1528910
borderlands3 │ 1197192 aww │ 1439008
aww │ 1168257 wholesomememes │ 1436566
insanepeoplefacebook │ 1155473 SquaredCircle │ 1432172
FortniteCompetitive │ 1122778 insanepeoplefacebook │ 1290686
EpicSeven │ 1117380 borderlands3 │ 1274462
│ FreeKarma4U │ 1116423 │ FreeKarma4U │ 1217769
│ YangForPresidentHQ │ 1086700 │ YangForPresidentHQ │ 1186918
SquaredCircle │ 1044089 FortniteCompetitive │ 1184508
MurderedByWords │ 1042511 AskMen │ 1180820
AskMen │ 1024434 EpicSeven │ 1172061
thedivision │ 1016634 MurderedByWords │ 1112476
barstoolsports │ 985032 politics │ 1084087
nfl │ 978340 │ barstoolsports │ 1068020 │
│ BattlefieldV │ 971408 │ │ BattlefieldV │ 1053878 │
└──────────────────────┴──────────┘ └──────────────────────┴──────────┘
50 rows in set. Elapsed: 65.954 sec. Processed 13.48 billion rows, 79.67 GB (204.37 million rows/s., 1.21 GB/s.) 50 rows in set. Elapsed: 10.680 sec. Processed 29.38 billion rows, 198.67 GB (2.75 billion rows/s., 18.60 GB/s.)
``` ```
12. One more query: let's compare ClickHouse mentions to other technologies like Snowflake and Postgres. This query is a big one because it has to search all the comments three times for a substring, and unfortunately ClickHouse user are obviously not very active on Reddit yet: 12. One more query: let's compare ClickHouse mentions to other technologies like Snowflake and Postgres. This query is a big one because it has to search all 14.69 billion comments three times for a substring, but the performance is actually quite impressive. (Unfortunately ClickHouse users are not very active on Reddit yet):
```sql ```sql
SELECT SELECT
@ -571,7 +644,7 @@ ORDER BY quarter ASC;
``` ```
```response ```response
┌────Quarter─┬─clickhouse─┬─snowflake─┬─postgres─┐ ┌────quarter─┬─clickhouse─┬─snowflake─┬─postgres─┐
│ 2005-10-01 │ 0 │ 0 │ 0 │ │ 2005-10-01 │ 0 │ 0 │ 0 │
│ 2006-01-01 │ 0 │ 2 │ 23 │ │ 2006-01-01 │ 0 │ 2 │ 23 │
│ 2006-04-01 │ 0 │ 2 │ 24 │ │ 2006-04-01 │ 0 │ 2 │ 24 │
@ -591,12 +664,12 @@ ORDER BY quarter ASC;
│ 2009-10-01 │ 0 │ 633 │ 589 │ │ 2009-10-01 │ 0 │ 633 │ 589 │
│ 2010-01-01 │ 0 │ 555 │ 501 │ │ 2010-01-01 │ 0 │ 555 │ 501 │
│ 2010-04-01 │ 0 │ 587 │ 469 │ │ 2010-04-01 │ 0 │ 587 │ 469 │
│ 2010-07-01 │ 0 │ 770 │ 821 │ 2010-07-01 │ 0 │ 601 │ 696
│ 2010-10-01 │ 0 │ 1480 │ 550 │ 2010-10-01 │ 0 │ 1246 │ 505
│ 2011-01-01 │ 0 │ 1482 │ 568 │ 2011-01-01 │ 0 │ 758 │ 247
│ 2011-04-01 │ 0 │ 1558 │ 406 │ 2011-04-01 │ 0 │ 537 │ 113
│ 2011-07-01 │ 0 │ 2163 │ 628 │ 2011-07-01 │ 0 │ 173 │ 64
│ 2011-10-01 │ 0 │ 4064 │ 566 │ │ 2011-10-01 │ 0 │ 649 │ 96 │
│ 2012-01-01 │ 0 │ 4621 │ 662 │ │ 2012-01-01 │ 0 │ 4621 │ 662 │
│ 2012-04-01 │ 0 │ 5737 │ 785 │ │ 2012-04-01 │ 0 │ 5737 │ 785 │
│ 2012-07-01 │ 0 │ 6097 │ 1127 │ │ 2012-07-01 │ 0 │ 6097 │ 1127 │
@ -628,9 +701,21 @@ ORDER BY quarter ASC;
│ 2019-01-01 │ 14 │ 80250 │ 4305 │ │ 2019-01-01 │ 14 │ 80250 │ 4305 │
│ 2019-04-01 │ 30 │ 70307 │ 3872 │ │ 2019-04-01 │ 30 │ 70307 │ 3872 │
│ 2019-07-01 │ 33 │ 77149 │ 4164 │ │ 2019-07-01 │ 33 │ 77149 │ 4164 │
│ 2019-10-01 │ 13 │ 76746 │ 3541 │ │ 2019-10-01 │ 22 │ 113011 │ 4369 │
│ 2020-01-01 │ 16 │ 54475 │ 846 │ │ 2020-01-01 │ 34 │ 238273 │ 5133 │
│ 2020-04-01 │ 52 │ 454467 │ 6100 │
│ 2020-07-01 │ 37 │ 406623 │ 5507 │
│ 2020-10-01 │ 49 │ 212143 │ 5385 │
│ 2021-01-01 │ 56 │ 151262 │ 5749 │
│ 2021-04-01 │ 71 │ 119928 │ 6039 │
│ 2021-07-01 │ 53 │ 110342 │ 5765 │
│ 2021-10-01 │ 92 │ 121144 │ 6401 │
│ 2022-01-01 │ 93 │ 107512 │ 6772 │
│ 2022-04-01 │ 120 │ 91560 │ 6687 │
│ 2022-07-01 │ 183 │ 99764 │ 7377 │
│ 2022-10-01 │ 123 │ 99447 │ 7052 │
│ 2023-01-01 │ 126 │ 58733 │ 4891 │
└────────────┴────────────┴───────────┴──────────┘ └────────────┴────────────┴───────────┴──────────┘
58 rows in set. Elapsed: 2663.751 sec. Processed 6.74 billion rows, 1.21 TB (2.53 million rows/s., 454.37 MB/s.) 70 rows in set. Elapsed: 325.835 sec. Processed 14.69 billion rows, 2.57 TB (45.08 million rows/s., 7.87 GB/s.)
``` ```