ClickHouse/docs/en/getting-started/example-datasets/youtube-dislikes.md

Ignoring revisions in .git-blame-ignore-revs. Click here to bypass and see the normal blame view.

488 lines
21 KiB
Markdown
Raw Normal View History

2023-03-26 03:21:06 +00:00
---
slug: /en/getting-started/example-datasets/youtube-dislikes
sidebar_label: YouTube Dislikes
description: A collection is dislikes of YouTube videos.
---
# YouTube dataset of dislikes
In November of 2021, YouTube removed the public ***dislike*** count from all of its videos. While creators can still see the number of dislikes, viewers can only see how many ***likes*** a video has received.
:::important
2023-03-26 22:11:05 +00:00
The dataset has over 4.55 billion records, so be careful just copying-and-pasting the commands below unless your resources can handle that type of volume. The commands below were executed on a **Production** instance of [ClickHouse Cloud](https://clickhouse.cloud).
2023-03-26 03:21:06 +00:00
:::
2023-03-26 22:11:05 +00:00
The data is in a JSON format and can be downloaded from [archive.org](https://archive.org/download/dislikes_youtube_2021_12_video_json_files). We have made this same data available in S3 so that it can be downloaded more efficiently into a ClickHouse Cloud instance.
2023-03-26 03:21:06 +00:00
Here are the steps to create a table in ClickHouse Cloud and insert the data.
:::note
The steps below will easily work on a local install of ClickHouse too. The only change would be to use the `s3` function instead of `s3cluster` (unless you have a cluster configured - in which case change `default` to the name of your cluster).
:::
2023-03-26 04:06:45 +00:00
## Step-by-step instructions
2023-06-02 11:30:05 +00:00
1. Let's see what the data looks like. The `s3cluster` table function returns a table, so we can `DESCRIBE` the result:
2023-03-26 03:21:06 +00:00
```sql
2023-03-26 22:11:05 +00:00
DESCRIBE s3Cluster(
'default',
'https://clickhouse-public-datasets.s3.amazonaws.com/youtube/original/files/*.zst',
'JSONLines'
);
2023-03-26 03:21:06 +00:00
```
ClickHouse infers the following schema from the JSON file:
```response
┌─name────────────────┬─type─────────────────────────────────┐
│ id │ Nullable(String) │
│ fetch_date │ Nullable(Int64) │
│ upload_date │ Nullable(String) │
│ title │ Nullable(String) │
│ uploader_id │ Nullable(String) │
│ uploader │ Nullable(String) │
│ uploader_sub_count │ Nullable(Int64) │
│ is_age_limit │ Nullable(Bool) │
│ view_count │ Nullable(Int64) │
│ like_count │ Nullable(Int64) │
│ dislike_count │ Nullable(Int64) │
│ is_crawlable │ Nullable(Bool) │
│ is_live_content │ Nullable(Bool) │
│ has_subtitles │ Nullable(Bool) │
│ is_ads_enabled │ Nullable(Bool) │
│ is_comments_enabled │ Nullable(Bool) │
│ description │ Nullable(String) │
│ rich_metadata │ Array(Map(String, Nullable(String))) │
│ super_titles │ Array(Map(String, Nullable(String))) │
│ uploader_badges │ Nullable(String) │
│ video_badges │ Nullable(String) │
└─────────────────────┴──────────────────────────────────────┘
```
2. Based on the inferred schema, we cleaned up the data types and added a primary key. Define the following table:
```sql
CREATE TABLE youtube
(
`id` String,
`fetch_date` DateTime,
2023-03-31 16:32:38 +00:00
`upload_date_str` String,
`upload_date` Date,
2023-03-26 03:21:06 +00:00
`title` String,
`uploader_id` String,
`uploader` String,
`uploader_sub_count` Int64,
`is_age_limit` Bool,
`view_count` Int64,
`like_count` Int64,
`dislike_count` Int64,
`is_crawlable` Bool,
`has_subtitles` Bool,
`is_ads_enabled` Bool,
`is_comments_enabled` Bool,
`description` String,
`rich_metadata` Array(Map(String, String)),
`super_titles` Array(Map(String, String)),
`uploader_badges` String,
`video_badges` String
)
ENGINE = MergeTree
2023-03-31 16:32:38 +00:00
ORDER BY (uploader, upload_date);
2023-03-26 03:21:06 +00:00
```
2023-03-26 04:06:45 +00:00
3. The following command streams the records from the S3 files into the `youtube` table.
:::important
This inserts a lot of data - 4.65 billion rows. If you do not want the entire dataset, simply add a `LIMIT` clause with the desired number of rows.
:::
2023-03-26 03:21:06 +00:00
```sql
INSERT INTO youtube
SETTINGS input_format_null_as_default = 1
SELECT
id,
2023-03-31 16:32:38 +00:00
parseDateTimeBestEffortUSOrZero(toString(fetch_date)) AS fetch_date,
upload_date AS upload_date_str,
2023-03-31 18:41:22 +00:00
toDate(parseDateTimeBestEffortUSOrZero(upload_date::String)) AS upload_date,
2023-03-26 03:21:06 +00:00
ifNull(title, '') AS title,
uploader_id,
ifNull(uploader, '') AS uploader,
uploader_sub_count,
is_age_limit,
view_count,
like_count,
dislike_count,
is_crawlable,
has_subtitles,
is_ads_enabled,
is_comments_enabled,
ifNull(description, '') AS description,
rich_metadata,
super_titles,
ifNull(uploader_badges, '') AS uploader_badges,
ifNull(video_badges, '') AS video_badges
2023-04-14 20:32:55 +00:00
FROM s3(
2023-03-31 16:32:38 +00:00
'https://clickhouse-public-datasets.s3.amazonaws.com/youtube/original/files/*.zst',
'JSONLines'
)
2023-03-26 03:21:06 +00:00
```
2023-03-31 16:32:38 +00:00
Some comments about our `INSERT` command:
- The `parseDateTimeBestEffortUSOrZero` function is handy when the incoming date fields may not be in the proper format. If `fetch_date` does not get parsed properly, it will be set to `0`
2023-03-31 18:41:22 +00:00
- The `upload_date` column contains valid dates, but it also contains strings like "4 hours ago" - which is certainly not a valid date. We decided to store the original value in `upload_date_str` and attempt to parse it with `toDate(parseDateTimeBestEffortUSOrZero(upload_date::String))`. If the parsing fails we just get `0`
- We used `ifNull` to avoid getting `NULL` values in our table. If an incoming value is `NULL`, the `ifNull` function is setting the value to an empty string
2023-03-31 16:32:38 +00:00
2023-03-31 18:41:22 +00:00
4. Open a new tab in the SQL Console of ClickHouse Cloud (or a new `clickhouse-client` window) and watch the count increase. It will take a while to insert 4.56B rows, depending on your server resources. (Without any tweaking of settings, it takes about 4.5 hours.)
2023-03-26 03:47:55 +00:00
```sql
2023-03-26 22:11:05 +00:00
SELECT formatReadableQuantity(count())
FROM youtube
2023-03-26 03:47:55 +00:00
```
2023-03-26 22:11:05 +00:00
```response
┌─formatReadableQuantity(count())─┐
│ 4.56 billion │
└─────────────────────────────────┘
```
5. Once the data is inserted, go ahead and count the number of dislikes of your favorite videos or channels. Let's see how many videos were uploaded by ClickHouse:
2023-03-26 03:47:55 +00:00
```sql
2023-03-26 22:11:05 +00:00
SELECT count()
2023-03-26 03:47:55 +00:00
FROM youtube
2023-03-26 22:11:05 +00:00
WHERE uploader = 'ClickHouse';
2023-03-26 03:47:55 +00:00
```
2023-03-26 22:11:05 +00:00
```response
┌─count()─┐
│ 84 │
└─────────┘
1 row in set. Elapsed: 0.570 sec. Processed 237.57 thousand rows, 5.77 MB (416.54 thousand rows/s., 10.12 MB/s.)
```
:::note
The query above runs so quickly because we chose `uploader` as the first column of the primary key - so it only had to process 237k rows.
:::
2023-03-26 22:17:07 +00:00
6. Let's look and likes and dislikes of ClickHouse videos:
```sql
SELECT
title,
like_count,
dislike_count
FROM youtube
WHERE uploader = 'ClickHouse'
ORDER BY dislike_count DESC;
```
The response looks like:
```response
┌─title────────────────────────────────────────────────────────────────────────────────────────────────┬─like_count─┬─dislike_count─┐
│ ClickHouse v21.11 Release Webinar │ 52 │ 3 │
│ ClickHouse Introduction │ 97 │ 3 │
│ Casa Modelo Algarve │ 180 │ 3 │
│ Профайлер запросов: трудный путь │ 33 │ 3 │
│ ClickHouse в Курсометре │ 4 │ 2 │
│ 10 Good Reasons to Use ClickHouse │ 27 │ 2 │
...
84 rows in set. Elapsed: 0.013 sec. Processed 155.65 thousand rows, 16.94 MB (11.96 million rows/s., 1.30 GB/s.)
```
7. Here is a search for videos with **ClickHouse** in the `title` or `description` fields:
2023-03-26 03:47:55 +00:00
```sql
SELECT
view_count,
like_count,
dislike_count,
concat('https://youtu.be/', id) AS url,
title
FROM youtube
WHERE (title ILIKE '%ClickHouse%') OR (description ILIKE '%ClickHouse%')
ORDER BY
like_count DESC,
2023-03-31 22:03:44 +00:00
view_count DESC;
2023-03-26 22:11:05 +00:00
```
This query has to process every row, and also parse through two columns of strings. Even then, we get decent performance at 4.15M rows/second:
```response
1174 rows in set. Elapsed: 1099.368 sec. Processed 4.56 billion rows, 1.98 TB (4.15 million rows/s., 1.80 GB/s.)
2023-03-26 03:47:55 +00:00
```
2023-03-26 22:11:05 +00:00
The results look like:
```response
┌─view_count─┬─like_count─┬─dislike_count─┬─url──────────────────────────┬─title──────────────────────────────────────────────────────────────────────────────────────────────────┐
│ 1919 │ 63 │ 1 │ https://youtu.be/b9MeoOtAivQ │ ClickHouse v21.10 Release Webinar │
│ 8710 │ 62 │ 4 │ https://youtu.be/PeV1mC2z--M │ What is JDBC DriverManager? | JDBC │
│ 3534 │ 62 │ 1 │ https://youtu.be/8nWRhK9gw10 │ CLICKHOUSE - Arquitetura Modular │
2023-03-29 14:29:22 +00:00
```
## Questions
### If someone disables comments does it lower the chance someone will actually click like or dislike?
When commenting is disabled, are people more likely to like or dislike to express their feelings about a video?
```sql
SELECT
concat('< ', formatReadableQuantity(view_range)) AS views,
is_comments_enabled,
total_clicks / num_views AS prob_like_dislike
FROM
(
SELECT
is_comments_enabled,
power(10, CEILING(log10(view_count + 1))) AS view_range,
sum(like_count + dislike_count) AS total_clicks,
sum(view_count) AS num_views
FROM youtube
GROUP BY
view_range,
is_comments_enabled
) WHERE view_range > 1
ORDER BY
is_comments_enabled ASC,
2023-03-30 22:41:29 +00:00
num_views ASC;
2023-03-29 14:29:22 +00:00
```
```response
┌─views─────────────┬─is_comments_enabled─┬────prob_like_dislike─┐
< 10.00 false 0.08224180712685371
< 100.00 false 0.06346337759167248
< 1.00 thousand false 0.03201883652987105
< 10.00 thousand false 0.01716073540410903
< 10.00 billion false 0.004555639481829971
< 100.00 thousand false 0.01293351460515323
< 1.00 billion false 0.004761811192464957
< 1.00 million false 0.010472604018980551
< 10.00 million false 0.00788902538420125
< 100.00 million false 0.00579152804250582
< 10.00 true 0.09819517478134059
< 100.00 true 0.07403784478585775
< 1.00 thousand true 0.03846294910067627
< 10.00 billion true 0.005615217329358215
< 10.00 thousand true 0.02505881391701455
< 1.00 billion true 0.007434998802482997
< 100.00 thousand true 0.022694648130822004
< 100.00 million true 0.011761563746575625
< 1.00 million true 0.020776022304589435
< 10.00 million true 0.016917095718089584
└───────────────────┴─────────────────────┴──────────────────────┘
22 rows in set. Elapsed: 8.460 sec. Processed 4.56 billion rows, 77.48 GB (538.73 million rows/s., 9.16 GB/s.)
```
Enabling comments seems to be correlated with a higher rate of engagement.
2023-03-31 16:32:38 +00:00
### How does the number of videos change over time - notable events?
```sql
SELECT
toStartOfMonth(toDateTime(upload_date)) AS month,
uniq(uploader_id) AS uploaders,
count() as num_videos,
sum(view_count) as view_count
FROM youtube
GROUP BY month
2023-03-31 18:41:22 +00:00
ORDER BY month ASC;
2023-03-31 16:32:38 +00:00
```
```response
┌──────month─┬─uploaders─┬─num_videos─┬───view_count─┐
│ 2005-04-01 │ 5 │ 6 │ 213597737 │
│ 2005-05-01 │ 6 │ 9 │ 2944005 │
│ 2005-06-01 │ 165 │ 351 │ 18624981 │
│ 2005-07-01 │ 395 │ 1168 │ 94164872 │
│ 2005-08-01 │ 1171 │ 3128 │ 124540774 │
│ 2005-09-01 │ 2418 │ 5206 │ 475536249 │
│ 2005-10-01 │ 6750 │ 13747 │ 737593613 │
│ 2005-11-01 │ 13706 │ 28078 │ 1896116976 │
│ 2005-12-01 │ 24756 │ 49885 │ 2478418930 │
│ 2006-01-01 │ 49992 │ 100447 │ 4532656581 │
│ 2006-02-01 │ 67882 │ 138485 │ 5677516317 │
│ 2006-03-01 │ 103358 │ 212237 │ 8430301366 │
│ 2006-04-01 │ 114615 │ 234174 │ 9980760440 │
│ 2006-05-01 │ 152682 │ 332076 │ 14129117212 │
│ 2006-06-01 │ 193962 │ 429538 │ 17014143263 │
│ 2006-07-01 │ 234401 │ 530311 │ 18721143410 │
│ 2006-08-01 │ 281280 │ 614128 │ 20473502342 │
│ 2006-09-01 │ 312434 │ 679906 │ 23158422265 │
│ 2006-10-01 │ 404873 │ 897590 │ 27357846117 │
```
A spike of uploaders [around covid is noticeable](https://www.theverge.com/2020/3/27/21197642/youtube-with-me-style-videos-views-coronavirus-cook-workout-study-home-beauty).
2023-06-02 11:30:05 +00:00
### More subtitles over time and when
2023-03-31 16:32:38 +00:00
With advances in speech recognition, its easier than ever to create subtitles for video with youtube adding auto-captioning in late 2009 - was the jump then?
```sql
SELECT
toStartOfMonth(upload_date) AS month,
countIf(has_subtitles) / count() AS percent_subtitles,
2023-03-31 18:41:22 +00:00
percent_subtitles - any(percent_subtitles) OVER (
ORDER BY month ASC ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING
) AS previous
2023-03-31 16:32:38 +00:00
FROM youtube
GROUP BY month
2023-03-31 18:41:22 +00:00
ORDER BY month ASC;
2023-03-31 16:32:38 +00:00
```
```response
┌──────month─┬───percent_subtitles─┬────────────────previous─┐
│ 2015-01-01 │ 0.2652653881082824 │ 0.2652653881082824 │
│ 2015-02-01 │ 0.3147556050309162 │ 0.049490216922633834 │
│ 2015-03-01 │ 0.32460464492371877 │ 0.009849039892802558 │
│ 2015-04-01 │ 0.33471963051468445 │ 0.010114985590965686 │
│ 2015-05-01 │ 0.3168087575501062 │ -0.017910872964578273 │
│ 2015-06-01 │ 0.3162609788438222 │ -0.0005477787062839745 │
│ 2015-07-01 │ 0.31828767677518033 │ 0.0020266979313581235 │
│ 2015-08-01 │ 0.3045551564286859 │ -0.013732520346494415 │
│ 2015-09-01 │ 0.311221133995152 │ 0.006665977566466086 │
│ 2015-10-01 │ 0.30574870926812175 │ -0.005472424727030245 │
│ 2015-11-01 │ 0.31125409712077234 │ 0.0055053878526505895 │
│ 2015-12-01 │ 0.3190967954651779 │ 0.007842698344405541 │
│ 2016-01-01 │ 0.32636021432496176 │ 0.007263418859783877 │
```
The data results show a spike in 2009. Apparently at that, time YouTube was removing their community captions feature, which allowed you to upload captions for other people's video.
This prompted a very successful campaign to have creators add captions to their videos for hard of hearing and deaf viewers.
### Top uploaders over time
```sql
WITH uploaders AS
(
SELECT uploader
FROM youtube
GROUP BY uploader
ORDER BY sum(view_count) DESC
LIMIT 10
)
SELECT
month,
uploader,
sum(view_count) AS total_views,
avg(dislike_count / like_count) AS like_to_dislike_ratio
FROM youtube
WHERE uploader IN (uploaders)
GROUP BY
toStartOfMonth(upload_date) AS month,
uploader
ORDER BY
month ASC,
2023-03-31 18:41:22 +00:00
total_views DESC;
2023-03-31 16:32:38 +00:00
```
```response
┌──────month─┬─uploader───────────────────┬─total_views─┬─like_to_dislike_ratio─┐
│ 1970-01-01 │ T-Series │ 10957099 │ 0.022784656361208206 │
│ 1970-01-01 │ Ryan's World │ 0 │ 0.003035559410234172 │
│ 1970-01-01 │ SET India │ 0 │ nan │
│ 2006-09-01 │ Cocomelon - Nursery Rhymes │ 256406497 │ 0.7005566715978622 │
│ 2007-06-01 │ Cocomelon - Nursery Rhymes │ 33641320 │ 0.7088650914344298 │
│ 2008-02-01 │ WWE │ 43733469 │ 0.07198856488734842 │
│ 2008-03-01 │ WWE │ 16514541 │ 0.1230603715431997 │
│ 2008-04-01 │ WWE │ 5907295 │ 0.2089399470159618 │
│ 2008-05-01 │ WWE │ 7779627 │ 0.09101676560436774 │
│ 2008-06-01 │ WWE │ 7018780 │ 0.0974184753155297 │
│ 2008-07-01 │ WWE │ 4686447 │ 0.1263845422065158 │
│ 2008-08-01 │ WWE │ 4514312 │ 0.08384574274791441 │
│ 2008-09-01 │ WWE │ 3717092 │ 0.07872802579349912 │
```
2023-03-29 14:29:22 +00:00
### How do like ratio changes as views go up?
```sql
SELECT
concat('< ', formatReadableQuantity(view_range)) AS view_range,
is_comments_enabled,
round(like_ratio, 2) AS like_ratio
FROM
(
SELECT
power(10, CEILING(log10(view_count + 1))) as view_range,
is_comments_enabled,
avg(like_count / dislike_count) as like_ratio
FROM youtube WHERE dislike_count > 0
GROUP BY
view_range,
is_comments_enabled HAVING view_range > 1
ORDER BY
view_range ASC,
is_comments_enabled ASC
2023-03-30 22:41:29 +00:00
);
2023-03-29 14:29:22 +00:00
```
```response
┌─view_range────────┬─is_comments_enabled─┬─like_ratio─┐
< 10.00 false 0.66
< 10.00 true 0.66
< 100.00 false 3
< 100.00 true 3.95
< 1.00 thousand false 8.45
< 1.00 thousand true 13.07
< 10.00 thousand false 18.57
< 10.00 thousand true 30.92
< 100.00 thousand false 23.55
< 100.00 thousand true 42.13
< 1.00 million false 19.23
< 1.00 million true 37.86
< 10.00 million false 12.13
< 10.00 million true 30.72
< 100.00 million false 6.67
< 100.00 million true 23.32
< 1.00 billion false 3.08
< 1.00 billion true 20.69
< 10.00 billion false 1.77
< 10.00 billion true 19.5
└───────────────────┴─────────────────────┴────────────┘
```
### How are views distributed?
```sql
SELECT
labels AS percentile,
round(quantiles) AS views
FROM
(
SELECT
quantiles(0.999, 0.99, 0.95, 0.9, 0.8, 0.7, 0.6, 0.5, 0.4, 0.3, 0.2, 0.1)(view_count) AS quantiles,
['99.9th', '99th', '95th', '90th', '80th', '70th','60th', '50th', '40th', '30th', '20th', '10th'] AS labels
FROM youtube
)
ARRAY JOIN
quantiles,
2023-03-30 22:41:29 +00:00
labels;
2023-03-29 14:29:22 +00:00
```
```response
┌─percentile─┬───views─┐
│ 99.9th │ 1216624 │
│ 99th │ 143519 │
│ 95th │ 13542 │
│ 90th │ 4054 │
│ 80th │ 950 │
│ 70th │ 363 │
│ 60th │ 177 │
│ 50th │ 97 │
│ 40th │ 57 │
│ 30th │ 32 │
│ 20th │ 16 │
│ 10th │ 6 │
└────────────┴─────────┘
2023-06-02 11:30:05 +00:00
```