mirror of
https://github.com/ClickHouse/ClickHouse.git
synced 2024-11-24 00:22:29 +00:00
488 lines
21 KiB
Markdown
488 lines
21 KiB
Markdown
---
|
||
slug: /en/getting-started/example-datasets/youtube-dislikes
|
||
sidebar_label: YouTube Dislikes
|
||
description: A collection is dislikes of YouTube videos.
|
||
---
|
||
|
||
# YouTube dataset of dislikes
|
||
|
||
In November of 2021, YouTube removed the public ***dislike*** count from all of its videos. While creators can still see the number of dislikes, viewers can only see how many ***likes*** a video has received.
|
||
|
||
:::important
|
||
The dataset has over 4.55 billion records, so be careful just copying-and-pasting the commands below unless your resources can handle that type of volume. The commands below were executed on a **Production** instance of [ClickHouse Cloud](https://clickhouse.cloud).
|
||
:::
|
||
|
||
The data is in a JSON format and can be downloaded from [archive.org](https://archive.org/download/dislikes_youtube_2021_12_video_json_files). We have made this same data available in S3 so that it can be downloaded more efficiently into a ClickHouse Cloud instance.
|
||
|
||
Here are the steps to create a table in ClickHouse Cloud and insert the data.
|
||
|
||
:::note
|
||
The steps below will easily work on a local install of ClickHouse too. The only change would be to use the `s3` function instead of `s3cluster` (unless you have a cluster configured - in which case change `default` to the name of your cluster).
|
||
:::
|
||
|
||
## Step-by-step instructions
|
||
|
||
1. Let's see what the data looks like. The `s3cluster` table function returns a table, so we can `DESCRIBE` the result:
|
||
|
||
```sql
|
||
DESCRIBE s3Cluster(
|
||
'default',
|
||
'https://clickhouse-public-datasets.s3.amazonaws.com/youtube/original/files/*.zst',
|
||
'JSONLines'
|
||
);
|
||
```
|
||
|
||
ClickHouse infers the following schema from the JSON file:
|
||
|
||
```response
|
||
┌─name────────────────┬─type─────────────────────────────────┐
|
||
│ id │ Nullable(String) │
|
||
│ fetch_date │ Nullable(Int64) │
|
||
│ upload_date │ Nullable(String) │
|
||
│ title │ Nullable(String) │
|
||
│ uploader_id │ Nullable(String) │
|
||
│ uploader │ Nullable(String) │
|
||
│ uploader_sub_count │ Nullable(Int64) │
|
||
│ is_age_limit │ Nullable(Bool) │
|
||
│ view_count │ Nullable(Int64) │
|
||
│ like_count │ Nullable(Int64) │
|
||
│ dislike_count │ Nullable(Int64) │
|
||
│ is_crawlable │ Nullable(Bool) │
|
||
│ is_live_content │ Nullable(Bool) │
|
||
│ has_subtitles │ Nullable(Bool) │
|
||
│ is_ads_enabled │ Nullable(Bool) │
|
||
│ is_comments_enabled │ Nullable(Bool) │
|
||
│ description │ Nullable(String) │
|
||
│ rich_metadata │ Array(Map(String, Nullable(String))) │
|
||
│ super_titles │ Array(Map(String, Nullable(String))) │
|
||
│ uploader_badges │ Nullable(String) │
|
||
│ video_badges │ Nullable(String) │
|
||
└─────────────────────┴──────────────────────────────────────┘
|
||
```
|
||
|
||
2. Based on the inferred schema, we cleaned up the data types and added a primary key. Define the following table:
|
||
|
||
```sql
|
||
CREATE TABLE youtube
|
||
(
|
||
`id` String,
|
||
`fetch_date` DateTime,
|
||
`upload_date_str` String,
|
||
`upload_date` Date,
|
||
`title` String,
|
||
`uploader_id` String,
|
||
`uploader` String,
|
||
`uploader_sub_count` Int64,
|
||
`is_age_limit` Bool,
|
||
`view_count` Int64,
|
||
`like_count` Int64,
|
||
`dislike_count` Int64,
|
||
`is_crawlable` Bool,
|
||
`has_subtitles` Bool,
|
||
`is_ads_enabled` Bool,
|
||
`is_comments_enabled` Bool,
|
||
`description` String,
|
||
`rich_metadata` Array(Map(String, String)),
|
||
`super_titles` Array(Map(String, String)),
|
||
`uploader_badges` String,
|
||
`video_badges` String
|
||
)
|
||
ENGINE = MergeTree
|
||
ORDER BY (uploader, upload_date);
|
||
```
|
||
|
||
3. The following command streams the records from the S3 files into the `youtube` table.
|
||
|
||
:::important
|
||
This inserts a lot of data - 4.65 billion rows. If you do not want the entire dataset, simply add a `LIMIT` clause with the desired number of rows.
|
||
:::
|
||
|
||
```sql
|
||
INSERT INTO youtube
|
||
SETTINGS input_format_null_as_default = 1
|
||
SELECT
|
||
id,
|
||
parseDateTimeBestEffortUSOrZero(toString(fetch_date)) AS fetch_date,
|
||
upload_date AS upload_date_str,
|
||
toDate(parseDateTimeBestEffortUSOrZero(upload_date::String)) AS upload_date,
|
||
ifNull(title, '') AS title,
|
||
uploader_id,
|
||
ifNull(uploader, '') AS uploader,
|
||
uploader_sub_count,
|
||
is_age_limit,
|
||
view_count,
|
||
like_count,
|
||
dislike_count,
|
||
is_crawlable,
|
||
has_subtitles,
|
||
is_ads_enabled,
|
||
is_comments_enabled,
|
||
ifNull(description, '') AS description,
|
||
rich_metadata,
|
||
super_titles,
|
||
ifNull(uploader_badges, '') AS uploader_badges,
|
||
ifNull(video_badges, '') AS video_badges
|
||
FROM s3(
|
||
'https://clickhouse-public-datasets.s3.amazonaws.com/youtube/original/files/*.zst',
|
||
'JSONLines'
|
||
)
|
||
```
|
||
|
||
Some comments about our `INSERT` command:
|
||
|
||
- The `parseDateTimeBestEffortUSOrZero` function is handy when the incoming date fields may not be in the proper format. If `fetch_date` does not get parsed properly, it will be set to `0`
|
||
- The `upload_date` column contains valid dates, but it also contains strings like "4 hours ago" - which is certainly not a valid date. We decided to store the original value in `upload_date_str` and attempt to parse it with `toDate(parseDateTimeBestEffortUSOrZero(upload_date::String))`. If the parsing fails we just get `0`
|
||
- We used `ifNull` to avoid getting `NULL` values in our table. If an incoming value is `NULL`, the `ifNull` function is setting the value to an empty string
|
||
|
||
4. Open a new tab in the SQL Console of ClickHouse Cloud (or a new `clickhouse-client` window) and watch the count increase. It will take a while to insert 4.56B rows, depending on your server resources. (Without any tweaking of settings, it takes about 4.5 hours.)
|
||
|
||
```sql
|
||
SELECT formatReadableQuantity(count())
|
||
FROM youtube
|
||
```
|
||
|
||
```response
|
||
┌─formatReadableQuantity(count())─┐
|
||
│ 4.56 billion │
|
||
└─────────────────────────────────┘
|
||
```
|
||
|
||
5. Once the data is inserted, go ahead and count the number of dislikes of your favorite videos or channels. Let's see how many videos were uploaded by ClickHouse:
|
||
|
||
```sql
|
||
SELECT count()
|
||
FROM youtube
|
||
WHERE uploader = 'ClickHouse';
|
||
```
|
||
|
||
```response
|
||
┌─count()─┐
|
||
│ 84 │
|
||
└─────────┘
|
||
|
||
1 row in set. Elapsed: 0.570 sec. Processed 237.57 thousand rows, 5.77 MB (416.54 thousand rows/s., 10.12 MB/s.)
|
||
```
|
||
|
||
:::note
|
||
The query above runs so quickly because we chose `uploader` as the first column of the primary key - so it only had to process 237k rows.
|
||
:::
|
||
|
||
6. Let's look and likes and dislikes of ClickHouse videos:
|
||
|
||
```sql
|
||
SELECT
|
||
title,
|
||
like_count,
|
||
dislike_count
|
||
FROM youtube
|
||
WHERE uploader = 'ClickHouse'
|
||
ORDER BY dislike_count DESC;
|
||
```
|
||
|
||
The response looks like:
|
||
|
||
```response
|
||
┌─title────────────────────────────────────────────────────────────────────────────────────────────────┬─like_count─┬─dislike_count─┐
|
||
│ ClickHouse v21.11 Release Webinar │ 52 │ 3 │
|
||
│ ClickHouse Introduction │ 97 │ 3 │
|
||
│ Casa Modelo Algarve │ 180 │ 3 │
|
||
│ Профайлер запросов: трудный путь │ 33 │ 3 │
|
||
│ ClickHouse в Курсометре │ 4 │ 2 │
|
||
│ 10 Good Reasons to Use ClickHouse │ 27 │ 2 │
|
||
...
|
||
|
||
84 rows in set. Elapsed: 0.013 sec. Processed 155.65 thousand rows, 16.94 MB (11.96 million rows/s., 1.30 GB/s.)
|
||
```
|
||
|
||
7. Here is a search for videos with **ClickHouse** in the `title` or `description` fields:
|
||
|
||
```sql
|
||
SELECT
|
||
view_count,
|
||
like_count,
|
||
dislike_count,
|
||
concat('https://youtu.be/', id) AS url,
|
||
title
|
||
FROM youtube
|
||
WHERE (title ILIKE '%ClickHouse%') OR (description ILIKE '%ClickHouse%')
|
||
ORDER BY
|
||
like_count DESC,
|
||
view_count DESC;
|
||
```
|
||
|
||
This query has to process every row, and also parse through two columns of strings. Even then, we get decent performance at 4.15M rows/second:
|
||
|
||
```response
|
||
1174 rows in set. Elapsed: 1099.368 sec. Processed 4.56 billion rows, 1.98 TB (4.15 million rows/s., 1.80 GB/s.)
|
||
```
|
||
|
||
The results look like:
|
||
|
||
```response
|
||
┌─view_count─┬─like_count─┬─dislike_count─┬─url──────────────────────────┬─title──────────────────────────────────────────────────────────────────────────────────────────────────┐
|
||
│ 1919 │ 63 │ 1 │ https://youtu.be/b9MeoOtAivQ │ ClickHouse v21.10 Release Webinar │
|
||
│ 8710 │ 62 │ 4 │ https://youtu.be/PeV1mC2z--M │ What is JDBC DriverManager? | JDBC │
|
||
│ 3534 │ 62 │ 1 │ https://youtu.be/8nWRhK9gw10 │ CLICKHOUSE - Arquitetura Modular │
|
||
```
|
||
|
||
## Questions
|
||
|
||
### If someone disables comments does it lower the chance someone will actually click like or dislike?
|
||
|
||
When commenting is disabled, are people more likely to like or dislike to express their feelings about a video?
|
||
|
||
```sql
|
||
SELECT
|
||
concat('< ', formatReadableQuantity(view_range)) AS views,
|
||
is_comments_enabled,
|
||
total_clicks / num_views AS prob_like_dislike
|
||
FROM
|
||
(
|
||
SELECT
|
||
is_comments_enabled,
|
||
power(10, CEILING(log10(view_count + 1))) AS view_range,
|
||
sum(like_count + dislike_count) AS total_clicks,
|
||
sum(view_count) AS num_views
|
||
FROM youtube
|
||
GROUP BY
|
||
view_range,
|
||
is_comments_enabled
|
||
) WHERE view_range > 1
|
||
ORDER BY
|
||
is_comments_enabled ASC,
|
||
num_views ASC;
|
||
```
|
||
|
||
```response
|
||
┌─views─────────────┬─is_comments_enabled─┬────prob_like_dislike─┐
|
||
│ < 10.00 │ false │ 0.08224180712685371 │
|
||
│ < 100.00 │ false │ 0.06346337759167248 │
|
||
│ < 1.00 thousand │ false │ 0.03201883652987105 │
|
||
│ < 10.00 thousand │ false │ 0.01716073540410903 │
|
||
│ < 10.00 billion │ false │ 0.004555639481829971 │
|
||
│ < 100.00 thousand │ false │ 0.01293351460515323 │
|
||
│ < 1.00 billion │ false │ 0.004761811192464957 │
|
||
│ < 1.00 million │ false │ 0.010472604018980551 │
|
||
│ < 10.00 million │ false │ 0.00788902538420125 │
|
||
│ < 100.00 million │ false │ 0.00579152804250582 │
|
||
│ < 10.00 │ true │ 0.09819517478134059 │
|
||
│ < 100.00 │ true │ 0.07403784478585775 │
|
||
│ < 1.00 thousand │ true │ 0.03846294910067627 │
|
||
│ < 10.00 billion │ true │ 0.005615217329358215 │
|
||
│ < 10.00 thousand │ true │ 0.02505881391701455 │
|
||
│ < 1.00 billion │ true │ 0.007434998802482997 │
|
||
│ < 100.00 thousand │ true │ 0.022694648130822004 │
|
||
│ < 100.00 million │ true │ 0.011761563746575625 │
|
||
│ < 1.00 million │ true │ 0.020776022304589435 │
|
||
│ < 10.00 million │ true │ 0.016917095718089584 │
|
||
└───────────────────┴─────────────────────┴──────────────────────┘
|
||
|
||
22 rows in set. Elapsed: 8.460 sec. Processed 4.56 billion rows, 77.48 GB (538.73 million rows/s., 9.16 GB/s.)
|
||
```
|
||
|
||
Enabling comments seems to be correlated with a higher rate of engagement.
|
||
|
||
|
||
### How does the number of videos change over time - notable events?
|
||
|
||
```sql
|
||
SELECT
|
||
toStartOfMonth(toDateTime(upload_date)) AS month,
|
||
uniq(uploader_id) AS uploaders,
|
||
count() as num_videos,
|
||
sum(view_count) as view_count
|
||
FROM youtube
|
||
GROUP BY month
|
||
ORDER BY month ASC;
|
||
```
|
||
|
||
```response
|
||
┌──────month─┬─uploaders─┬─num_videos─┬───view_count─┐
|
||
│ 2005-04-01 │ 5 │ 6 │ 213597737 │
|
||
│ 2005-05-01 │ 6 │ 9 │ 2944005 │
|
||
│ 2005-06-01 │ 165 │ 351 │ 18624981 │
|
||
│ 2005-07-01 │ 395 │ 1168 │ 94164872 │
|
||
│ 2005-08-01 │ 1171 │ 3128 │ 124540774 │
|
||
│ 2005-09-01 │ 2418 │ 5206 │ 475536249 │
|
||
│ 2005-10-01 │ 6750 │ 13747 │ 737593613 │
|
||
│ 2005-11-01 │ 13706 │ 28078 │ 1896116976 │
|
||
│ 2005-12-01 │ 24756 │ 49885 │ 2478418930 │
|
||
│ 2006-01-01 │ 49992 │ 100447 │ 4532656581 │
|
||
│ 2006-02-01 │ 67882 │ 138485 │ 5677516317 │
|
||
│ 2006-03-01 │ 103358 │ 212237 │ 8430301366 │
|
||
│ 2006-04-01 │ 114615 │ 234174 │ 9980760440 │
|
||
│ 2006-05-01 │ 152682 │ 332076 │ 14129117212 │
|
||
│ 2006-06-01 │ 193962 │ 429538 │ 17014143263 │
|
||
│ 2006-07-01 │ 234401 │ 530311 │ 18721143410 │
|
||
│ 2006-08-01 │ 281280 │ 614128 │ 20473502342 │
|
||
│ 2006-09-01 │ 312434 │ 679906 │ 23158422265 │
|
||
│ 2006-10-01 │ 404873 │ 897590 │ 27357846117 │
|
||
```
|
||
|
||
A spike of uploaders [around covid is noticeable](https://www.theverge.com/2020/3/27/21197642/youtube-with-me-style-videos-views-coronavirus-cook-workout-study-home-beauty).
|
||
|
||
|
||
### More subtitles over time and when
|
||
|
||
With advances in speech recognition, it’s easier than ever to create subtitles for video with youtube adding auto-captioning in late 2009 - was the jump then?
|
||
|
||
```sql
|
||
SELECT
|
||
toStartOfMonth(upload_date) AS month,
|
||
countIf(has_subtitles) / count() AS percent_subtitles,
|
||
percent_subtitles - any(percent_subtitles) OVER (
|
||
ORDER BY month ASC ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING
|
||
) AS previous
|
||
FROM youtube
|
||
GROUP BY month
|
||
ORDER BY month ASC;
|
||
```
|
||
|
||
```response
|
||
┌──────month─┬───percent_subtitles─┬────────────────previous─┐
|
||
│ 2015-01-01 │ 0.2652653881082824 │ 0.2652653881082824 │
|
||
│ 2015-02-01 │ 0.3147556050309162 │ 0.049490216922633834 │
|
||
│ 2015-03-01 │ 0.32460464492371877 │ 0.009849039892802558 │
|
||
│ 2015-04-01 │ 0.33471963051468445 │ 0.010114985590965686 │
|
||
│ 2015-05-01 │ 0.3168087575501062 │ -0.017910872964578273 │
|
||
│ 2015-06-01 │ 0.3162609788438222 │ -0.0005477787062839745 │
|
||
│ 2015-07-01 │ 0.31828767677518033 │ 0.0020266979313581235 │
|
||
│ 2015-08-01 │ 0.3045551564286859 │ -0.013732520346494415 │
|
||
│ 2015-09-01 │ 0.311221133995152 │ 0.006665977566466086 │
|
||
│ 2015-10-01 │ 0.30574870926812175 │ -0.005472424727030245 │
|
||
│ 2015-11-01 │ 0.31125409712077234 │ 0.0055053878526505895 │
|
||
│ 2015-12-01 │ 0.3190967954651779 │ 0.007842698344405541 │
|
||
│ 2016-01-01 │ 0.32636021432496176 │ 0.007263418859783877 │
|
||
|
||
```
|
||
|
||
The data results show a spike in 2009. Apparently at that, time YouTube was removing their community captions feature, which allowed you to upload captions for other people's video.
|
||
This prompted a very successful campaign to have creators add captions to their videos for hard of hearing and deaf viewers.
|
||
|
||
|
||
### Top uploaders over time
|
||
|
||
```sql
|
||
WITH uploaders AS
|
||
(
|
||
SELECT uploader
|
||
FROM youtube
|
||
GROUP BY uploader
|
||
ORDER BY sum(view_count) DESC
|
||
LIMIT 10
|
||
)
|
||
SELECT
|
||
month,
|
||
uploader,
|
||
sum(view_count) AS total_views,
|
||
avg(dislike_count / like_count) AS like_to_dislike_ratio
|
||
FROM youtube
|
||
WHERE uploader IN (uploaders)
|
||
GROUP BY
|
||
toStartOfMonth(upload_date) AS month,
|
||
uploader
|
||
ORDER BY
|
||
month ASC,
|
||
total_views DESC;
|
||
```
|
||
|
||
```response
|
||
┌──────month─┬─uploader───────────────────┬─total_views─┬─like_to_dislike_ratio─┐
|
||
│ 1970-01-01 │ T-Series │ 10957099 │ 0.022784656361208206 │
|
||
│ 1970-01-01 │ Ryan's World │ 0 │ 0.003035559410234172 │
|
||
│ 1970-01-01 │ SET India │ 0 │ nan │
|
||
│ 2006-09-01 │ Cocomelon - Nursery Rhymes │ 256406497 │ 0.7005566715978622 │
|
||
│ 2007-06-01 │ Cocomelon - Nursery Rhymes │ 33641320 │ 0.7088650914344298 │
|
||
│ 2008-02-01 │ WWE │ 43733469 │ 0.07198856488734842 │
|
||
│ 2008-03-01 │ WWE │ 16514541 │ 0.1230603715431997 │
|
||
│ 2008-04-01 │ WWE │ 5907295 │ 0.2089399470159618 │
|
||
│ 2008-05-01 │ WWE │ 7779627 │ 0.09101676560436774 │
|
||
│ 2008-06-01 │ WWE │ 7018780 │ 0.0974184753155297 │
|
||
│ 2008-07-01 │ WWE │ 4686447 │ 0.1263845422065158 │
|
||
│ 2008-08-01 │ WWE │ 4514312 │ 0.08384574274791441 │
|
||
│ 2008-09-01 │ WWE │ 3717092 │ 0.07872802579349912 │
|
||
```
|
||
|
||
### How do like ratio changes as views go up?
|
||
|
||
```sql
|
||
SELECT
|
||
concat('< ', formatReadableQuantity(view_range)) AS view_range,
|
||
is_comments_enabled,
|
||
round(like_ratio, 2) AS like_ratio
|
||
FROM
|
||
(
|
||
SELECT
|
||
power(10, CEILING(log10(view_count + 1))) as view_range,
|
||
is_comments_enabled,
|
||
avg(like_count / dislike_count) as like_ratio
|
||
FROM youtube WHERE dislike_count > 0
|
||
GROUP BY
|
||
view_range,
|
||
is_comments_enabled HAVING view_range > 1
|
||
ORDER BY
|
||
view_range ASC,
|
||
is_comments_enabled ASC
|
||
);
|
||
```
|
||
|
||
```response
|
||
┌─view_range────────┬─is_comments_enabled─┬─like_ratio─┐
|
||
│ < 10.00 │ false │ 0.66 │
|
||
│ < 10.00 │ true │ 0.66 │
|
||
│ < 100.00 │ false │ 3 │
|
||
│ < 100.00 │ true │ 3.95 │
|
||
│ < 1.00 thousand │ false │ 8.45 │
|
||
│ < 1.00 thousand │ true │ 13.07 │
|
||
│ < 10.00 thousand │ false │ 18.57 │
|
||
│ < 10.00 thousand │ true │ 30.92 │
|
||
│ < 100.00 thousand │ false │ 23.55 │
|
||
│ < 100.00 thousand │ true │ 42.13 │
|
||
│ < 1.00 million │ false │ 19.23 │
|
||
│ < 1.00 million │ true │ 37.86 │
|
||
│ < 10.00 million │ false │ 12.13 │
|
||
│ < 10.00 million │ true │ 30.72 │
|
||
│ < 100.00 million │ false │ 6.67 │
|
||
│ < 100.00 million │ true │ 23.32 │
|
||
│ < 1.00 billion │ false │ 3.08 │
|
||
│ < 1.00 billion │ true │ 20.69 │
|
||
│ < 10.00 billion │ false │ 1.77 │
|
||
│ < 10.00 billion │ true │ 19.5 │
|
||
└───────────────────┴─────────────────────┴────────────┘
|
||
```
|
||
|
||
### How are views distributed?
|
||
|
||
```sql
|
||
SELECT
|
||
labels AS percentile,
|
||
round(quantiles) AS views
|
||
FROM
|
||
(
|
||
SELECT
|
||
quantiles(0.999, 0.99, 0.95, 0.9, 0.8, 0.7, 0.6, 0.5, 0.4, 0.3, 0.2, 0.1)(view_count) AS quantiles,
|
||
['99.9th', '99th', '95th', '90th', '80th', '70th','60th', '50th', '40th', '30th', '20th', '10th'] AS labels
|
||
FROM youtube
|
||
)
|
||
ARRAY JOIN
|
||
quantiles,
|
||
labels;
|
||
```
|
||
|
||
```response
|
||
┌─percentile─┬───views─┐
|
||
│ 99.9th │ 1216624 │
|
||
│ 99th │ 143519 │
|
||
│ 95th │ 13542 │
|
||
│ 90th │ 4054 │
|
||
│ 80th │ 950 │
|
||
│ 70th │ 363 │
|
||
│ 60th │ 177 │
|
||
│ 50th │ 97 │
|
||
│ 40th │ 57 │
|
||
│ 30th │ 32 │
|
||
│ 20th │ 16 │
|
||
│ 10th │ 6 │
|
||
└────────────┴─────────┘
|
||
```
|