--- slug: /en/getting-started/example-datasets/youtube-dislikes sidebar_label: YouTube Dislikes description: A collection is dislikes of YouTube videos. --- # YouTube dataset of dislikes In November of 2021, YouTube removed the public ***dislike*** count from all of its videos. While creators can still see the number of dislikes, viewers can only see how many ***likes*** a video has received. :::important The dataset has over 4.55 billion records, so be careful just copying-and-pasting the commands below unless your resources can handle that type of volume. The commands below were executed on a **Production** instance of [ClickHouse Cloud](https://clickhouse.cloud). ::: The data is in a JSON format and can be downloaded from [archive.org](https://archive.org/download/dislikes_youtube_2021_12_video_json_files). We have made this same data available in S3 so that it can be downloaded more efficiently into a ClickHouse Cloud instance. Here are the steps to create a table in ClickHouse Cloud and insert the data. :::note The steps below will easily work on a local install of ClickHouse too. The only change would be to use the `s3` function instead of `s3cluster` (unless you have a cluster configured - in which case change `default` to the name of your cluster). ::: ## Step-by-step instructions 1. Let's see what the data looks like. The `s3cluster` table function returns a table, so we can `DESCRIBE` the reult: ```sql DESCRIBE s3Cluster( 'default', 'https://clickhouse-public-datasets.s3.amazonaws.com/youtube/original/files/*.zst', 'JSONLines' ); ``` ClickHouse infers the following schema from the JSON file: ```response ┌─name────────────────┬─type─────────────────────────────────┐ │ id │ Nullable(String) │ │ fetch_date │ Nullable(Int64) │ │ upload_date │ Nullable(String) │ │ title │ Nullable(String) │ │ uploader_id │ Nullable(String) │ │ uploader │ Nullable(String) │ │ uploader_sub_count │ Nullable(Int64) │ │ is_age_limit │ Nullable(Bool) │ │ view_count │ Nullable(Int64) │ │ like_count │ Nullable(Int64) │ │ dislike_count │ Nullable(Int64) │ │ is_crawlable │ Nullable(Bool) │ │ is_live_content │ Nullable(Bool) │ │ has_subtitles │ Nullable(Bool) │ │ is_ads_enabled │ Nullable(Bool) │ │ is_comments_enabled │ Nullable(Bool) │ │ description │ Nullable(String) │ │ rich_metadata │ Array(Map(String, Nullable(String))) │ │ super_titles │ Array(Map(String, Nullable(String))) │ │ uploader_badges │ Nullable(String) │ │ video_badges │ Nullable(String) │ └─────────────────────┴──────────────────────────────────────┘ ``` 2. Based on the inferred schema, we cleaned up the data types and added a primary key. Define the following table: ```sql CREATE TABLE youtube ( `id` String, `fetch_date` DateTime, `upload_date_str` String, `upload_date` Date, `title` String, `uploader_id` String, `uploader` String, `uploader_sub_count` Int64, `is_age_limit` Bool, `view_count` Int64, `like_count` Int64, `dislike_count` Int64, `is_crawlable` Bool, `has_subtitles` Bool, `is_ads_enabled` Bool, `is_comments_enabled` Bool, `description` String, `rich_metadata` Array(Map(String, String)), `super_titles` Array(Map(String, String)), `uploader_badges` String, `video_badges` String ) ENGINE = MergeTree ORDER BY (uploader, upload_date); ``` 3. The following command streams the records from the S3 files into the `youtube` table. :::important This inserts a lot of data - 4.65 billion rows. If you do not want the entire dataset, simply add a `LIMIT` clause with the desired number of rows. ::: ```sql INSERT INTO youtube SETTINGS input_format_null_as_default = 1 SELECT id, parseDateTimeBestEffortUSOrZero(toString(fetch_date)) AS fetch_date, upload_date AS upload_date_str, toDate(parseDateTimeBestEffortUSOrZero(upload_date::String)) AS upload_date, ifNull(title, '') AS title, uploader_id, ifNull(uploader, '') AS uploader, uploader_sub_count, is_age_limit, view_count, like_count, dislike_count, is_crawlable, has_subtitles, is_ads_enabled, is_comments_enabled, ifNull(description, '') AS description, rich_metadata, super_titles, ifNull(uploader_badges, '') AS uploader_badges, ifNull(video_badges, '') AS video_badges FROM s3Cluster( 'default', 'https://clickhouse-public-datasets.s3.amazonaws.com/youtube/original/files/*.zst', 'JSONLines' ) SETTINGS max_download_threads = 24, max_insert_threads = 64, max_insert_block_size = 100000000, min_insert_block_size_rows = 100000000, min_insert_block_size_bytes = 500000000; ``` Some comments about our `INSERT` command: - The `parseDateTimeBestEffortUSOrZero` function is handy when the incoming date fields may not be in the proper format. If `fetch_date` does not get parsed properly, it will be set to `0` - The `upload_date` column contains valid dates, but it also contains strings like "4 hours ago" - which is certainly not a valid date. We decided to store the original value in `upload_date_str` and attempt to parse it with `toDate(parseDateTimeBestEffortUSOrZero(upload_date::String))`. If the parsing fails we just get `0` - We used `ifNull` to avoid getting `NULL` values in our table. If an incoming value is `NULL`, the `ifNull` function is setting the value to an empty string - It takes a long time to download the data, so we added a `SETTINGS` clause to spread out the work over more threads while making sure the block sizes stayed fairly large 4. Open a new tab in the SQL Console of ClickHouse Cloud (or a new `clickhouse-client` window) and watch the count increase. It will take a while to insert 4.56B rows, depending on your server resources. (Without any tweaking of settings, it takes about 4.5 hours.) ```sql SELECT formatReadableQuantity(count()) FROM youtube ``` ```response ┌─formatReadableQuantity(count())─┐ │ 4.56 billion │ └─────────────────────────────────┘ ``` 5. Once the data is inserted, go ahead and count the number of dislikes of your favorite videos or channels. Let's see how many videos were uploaded by ClickHouse: ```sql SELECT count() FROM youtube WHERE uploader = 'ClickHouse'; ``` ```response ┌─count()─┐ │ 84 │ └─────────┘ 1 row in set. Elapsed: 0.570 sec. Processed 237.57 thousand rows, 5.77 MB (416.54 thousand rows/s., 10.12 MB/s.) ``` :::note The query above runs so quickly because we chose `uploader` as the first column of the primary key - so it only had to process 237k rows. ::: 6. Let's look and likes and dislikes of ClickHouse videos: ```sql SELECT title, like_count, dislike_count FROM youtube WHERE uploader = 'ClickHouse' ORDER BY dislike_count DESC; ``` The response looks like: ```response ┌─title────────────────────────────────────────────────────────────────────────────────────────────────┬─like_count─┬─dislike_count─┐ │ ClickHouse v21.11 Release Webinar │ 52 │ 3 │ │ ClickHouse Introduction │ 97 │ 3 │ │ Casa Modelo Algarve │ 180 │ 3 │ │ Профайлер запросов: трудный путь │ 33 │ 3 │ │ ClickHouse в Курсометре │ 4 │ 2 │ │ 10 Good Reasons to Use ClickHouse │ 27 │ 2 │ ... 84 rows in set. Elapsed: 0.013 sec. Processed 155.65 thousand rows, 16.94 MB (11.96 million rows/s., 1.30 GB/s.) ``` 7. Here is a search for videos with **ClickHouse** in the `title` or `description` fields: ```sql SELECT view_count, like_count, dislike_count, concat('https://youtu.be/', id) AS url, title FROM youtube WHERE (title ILIKE '%ClickHouse%') OR (description ILIKE '%ClickHouse%') ORDER BY like_count DESC, view_count DESC; ``` This query has to process every row, and also parse through two columns of strings. Even then, we get decent performance at 4.15M rows/second: ```response 1174 rows in set. Elapsed: 1099.368 sec. Processed 4.56 billion rows, 1.98 TB (4.15 million rows/s., 1.80 GB/s.) ``` The results look like: ```response ┌─view_count─┬─like_count─┬─dislike_count─┬─url──────────────────────────┬─title──────────────────────────────────────────────────────────────────────────────────────────────────┐ │ 1919 │ 63 │ 1 │ https://youtu.be/b9MeoOtAivQ │ ClickHouse v21.10 Release Webinar │ │ 8710 │ 62 │ 4 │ https://youtu.be/PeV1mC2z--M │ What is JDBC DriverManager? | JDBC │ │ 3534 │ 62 │ 1 │ https://youtu.be/8nWRhK9gw10 │ CLICKHOUSE - Arquitetura Modular │ ``` ## Questions ### If someone disables comments does it lower the chance someone will actually click like or dislike? When commenting is disabled, are people more likely to like or dislike to express their feelings about a video? ```sql SELECT concat('< ', formatReadableQuantity(view_range)) AS views, is_comments_enabled, total_clicks / num_views AS prob_like_dislike FROM ( SELECT is_comments_enabled, power(10, CEILING(log10(view_count + 1))) AS view_range, sum(like_count + dislike_count) AS total_clicks, sum(view_count) AS num_views FROM youtube GROUP BY view_range, is_comments_enabled ) WHERE view_range > 1 ORDER BY is_comments_enabled ASC, num_views ASC; ``` ```response ┌─views─────────────┬─is_comments_enabled─┬────prob_like_dislike─┐ │ < 10.00 │ false │ 0.08224180712685371 │ │ < 100.00 │ false │ 0.06346337759167248 │ │ < 1.00 thousand │ false │ 0.03201883652987105 │ │ < 10.00 thousand │ false │ 0.01716073540410903 │ │ < 10.00 billion │ false │ 0.004555639481829971 │ │ < 100.00 thousand │ false │ 0.01293351460515323 │ │ < 1.00 billion │ false │ 0.004761811192464957 │ │ < 1.00 million │ false │ 0.010472604018980551 │ │ < 10.00 million │ false │ 0.00788902538420125 │ │ < 100.00 million │ false │ 0.00579152804250582 │ │ < 10.00 │ true │ 0.09819517478134059 │ │ < 100.00 │ true │ 0.07403784478585775 │ │ < 1.00 thousand │ true │ 0.03846294910067627 │ │ < 10.00 billion │ true │ 0.005615217329358215 │ │ < 10.00 thousand │ true │ 0.02505881391701455 │ │ < 1.00 billion │ true │ 0.007434998802482997 │ │ < 100.00 thousand │ true │ 0.022694648130822004 │ │ < 100.00 million │ true │ 0.011761563746575625 │ │ < 1.00 million │ true │ 0.020776022304589435 │ │ < 10.00 million │ true │ 0.016917095718089584 │ └───────────────────┴─────────────────────┴──────────────────────┘ 22 rows in set. Elapsed: 8.460 sec. Processed 4.56 billion rows, 77.48 GB (538.73 million rows/s., 9.16 GB/s.) ``` Enabling comments seems to be correlated with a higher rate of engagement. ### How does the number of videos change over time - notable events? ```sql SELECT toStartOfMonth(toDateTime(upload_date)) AS month, uniq(uploader_id) AS uploaders, count() as num_videos, sum(view_count) as view_count FROM youtube GROUP BY month ORDER BY month ASC; ``` ```response ┌──────month─┬─uploaders─┬─num_videos─┬───view_count─┐ │ 2005-04-01 │ 5 │ 6 │ 213597737 │ │ 2005-05-01 │ 6 │ 9 │ 2944005 │ │ 2005-06-01 │ 165 │ 351 │ 18624981 │ │ 2005-07-01 │ 395 │ 1168 │ 94164872 │ │ 2005-08-01 │ 1171 │ 3128 │ 124540774 │ │ 2005-09-01 │ 2418 │ 5206 │ 475536249 │ │ 2005-10-01 │ 6750 │ 13747 │ 737593613 │ │ 2005-11-01 │ 13706 │ 28078 │ 1896116976 │ │ 2005-12-01 │ 24756 │ 49885 │ 2478418930 │ │ 2006-01-01 │ 49992 │ 100447 │ 4532656581 │ │ 2006-02-01 │ 67882 │ 138485 │ 5677516317 │ │ 2006-03-01 │ 103358 │ 212237 │ 8430301366 │ │ 2006-04-01 │ 114615 │ 234174 │ 9980760440 │ │ 2006-05-01 │ 152682 │ 332076 │ 14129117212 │ │ 2006-06-01 │ 193962 │ 429538 │ 17014143263 │ │ 2006-07-01 │ 234401 │ 530311 │ 18721143410 │ │ 2006-08-01 │ 281280 │ 614128 │ 20473502342 │ │ 2006-09-01 │ 312434 │ 679906 │ 23158422265 │ │ 2006-10-01 │ 404873 │ 897590 │ 27357846117 │ ``` A spike of uploaders [around covid is noticeable](https://www.theverge.com/2020/3/27/21197642/youtube-with-me-style-videos-views-coronavirus-cook-workout-study-home-beauty). ### More subtitiles over time and when With advances in speech recognition, it’s easier than ever to create subtitles for video with youtube adding auto-captioning in late 2009 - was the jump then? ```sql SELECT toStartOfMonth(upload_date) AS month, countIf(has_subtitles) / count() AS percent_subtitles, percent_subtitles - any(percent_subtitles) OVER ( ORDER BY month ASC ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING ) AS previous FROM youtube GROUP BY month ORDER BY month ASC; ``` ```response ┌──────month─┬───percent_subtitles─┬────────────────previous─┐ │ 2015-01-01 │ 0.2652653881082824 │ 0.2652653881082824 │ │ 2015-02-01 │ 0.3147556050309162 │ 0.049490216922633834 │ │ 2015-03-01 │ 0.32460464492371877 │ 0.009849039892802558 │ │ 2015-04-01 │ 0.33471963051468445 │ 0.010114985590965686 │ │ 2015-05-01 │ 0.3168087575501062 │ -0.017910872964578273 │ │ 2015-06-01 │ 0.3162609788438222 │ -0.0005477787062839745 │ │ 2015-07-01 │ 0.31828767677518033 │ 0.0020266979313581235 │ │ 2015-08-01 │ 0.3045551564286859 │ -0.013732520346494415 │ │ 2015-09-01 │ 0.311221133995152 │ 0.006665977566466086 │ │ 2015-10-01 │ 0.30574870926812175 │ -0.005472424727030245 │ │ 2015-11-01 │ 0.31125409712077234 │ 0.0055053878526505895 │ │ 2015-12-01 │ 0.3190967954651779 │ 0.007842698344405541 │ │ 2016-01-01 │ 0.32636021432496176 │ 0.007263418859783877 │ ``` The data results show a spike in 2009. Apparently at that, time YouTube was removing their community captions feature, which allowed you to upload captions for other people's video. This prompted a very successful campaign to have creators add captions to their videos for hard of hearing and deaf viewers. ### Top uploaders over time ```sql WITH uploaders AS ( SELECT uploader FROM youtube GROUP BY uploader ORDER BY sum(view_count) DESC LIMIT 10 ) SELECT month, uploader, sum(view_count) AS total_views, avg(dislike_count / like_count) AS like_to_dislike_ratio FROM youtube WHERE uploader IN (uploaders) GROUP BY toStartOfMonth(upload_date) AS month, uploader ORDER BY month ASC, total_views DESC; ``` ```response ┌──────month─┬─uploader───────────────────┬─total_views─┬─like_to_dislike_ratio─┐ │ 1970-01-01 │ T-Series │ 10957099 │ 0.022784656361208206 │ │ 1970-01-01 │ Ryan's World │ 0 │ 0.003035559410234172 │ │ 1970-01-01 │ SET India │ 0 │ nan │ │ 2006-09-01 │ Cocomelon - Nursery Rhymes │ 256406497 │ 0.7005566715978622 │ │ 2007-06-01 │ Cocomelon - Nursery Rhymes │ 33641320 │ 0.7088650914344298 │ │ 2008-02-01 │ WWE │ 43733469 │ 0.07198856488734842 │ │ 2008-03-01 │ WWE │ 16514541 │ 0.1230603715431997 │ │ 2008-04-01 │ WWE │ 5907295 │ 0.2089399470159618 │ │ 2008-05-01 │ WWE │ 7779627 │ 0.09101676560436774 │ │ 2008-06-01 │ WWE │ 7018780 │ 0.0974184753155297 │ │ 2008-07-01 │ WWE │ 4686447 │ 0.1263845422065158 │ │ 2008-08-01 │ WWE │ 4514312 │ 0.08384574274791441 │ │ 2008-09-01 │ WWE │ 3717092 │ 0.07872802579349912 │ ``` ### How do like ratio changes as views go up? ```sql SELECT concat('< ', formatReadableQuantity(view_range)) AS view_range, is_comments_enabled, round(like_ratio, 2) AS like_ratio FROM ( SELECT power(10, CEILING(log10(view_count + 1))) as view_range, is_comments_enabled, avg(like_count / dislike_count) as like_ratio FROM youtube WHERE dislike_count > 0 GROUP BY view_range, is_comments_enabled HAVING view_range > 1 ORDER BY view_range ASC, is_comments_enabled ASC ); ``` ```response ┌─view_range────────┬─is_comments_enabled─┬─like_ratio─┐ │ < 10.00 │ false │ 0.66 │ │ < 10.00 │ true │ 0.66 │ │ < 100.00 │ false │ 3 │ │ < 100.00 │ true │ 3.95 │ │ < 1.00 thousand │ false │ 8.45 │ │ < 1.00 thousand │ true │ 13.07 │ │ < 10.00 thousand │ false │ 18.57 │ │ < 10.00 thousand │ true │ 30.92 │ │ < 100.00 thousand │ false │ 23.55 │ │ < 100.00 thousand │ true │ 42.13 │ │ < 1.00 million │ false │ 19.23 │ │ < 1.00 million │ true │ 37.86 │ │ < 10.00 million │ false │ 12.13 │ │ < 10.00 million │ true │ 30.72 │ │ < 100.00 million │ false │ 6.67 │ │ < 100.00 million │ true │ 23.32 │ │ < 1.00 billion │ false │ 3.08 │ │ < 1.00 billion │ true │ 20.69 │ │ < 10.00 billion │ false │ 1.77 │ │ < 10.00 billion │ true │ 19.5 │ └───────────────────┴─────────────────────┴────────────┘ ``` ### How are views distributed? ```sql SELECT labels AS percentile, round(quantiles) AS views FROM ( SELECT quantiles(0.999, 0.99, 0.95, 0.9, 0.8, 0.7, 0.6, 0.5, 0.4, 0.3, 0.2, 0.1)(view_count) AS quantiles, ['99.9th', '99th', '95th', '90th', '80th', '70th','60th', '50th', '40th', '30th', '20th', '10th'] AS labels FROM youtube ) ARRAY JOIN quantiles, labels; ``` ```response ┌─percentile─┬───views─┐ │ 99.9th │ 1216624 │ │ 99th │ 143519 │ │ 95th │ 13542 │ │ 90th │ 4054 │ │ 80th │ 950 │ │ 70th │ 363 │ │ 60th │ 177 │ │ 50th │ 97 │ │ 40th │ 57 │ │ 30th │ 32 │ │ 20th │ 16 │ │ 10th │ 6 │ └────────────┴─────────┘ ```