5 Main Takeaways from Randomly Sampling YouTube

We’re excited to announce the publication of our paper, Dialing for Videos: A Random Sample of YouTube, in the Journal of Quantitative Description: Digital Media. The article is the culmination of a long research project to better understand YouTube as a whole by producing a random sample of YouTube videos, analyzing their metadata, sending the audio files through a language detection pipeline, and hand-coding a subset of videos to better understand their content. 

While this blog post highlights five of what I think are the most interesting takeaways from the paper, anyone who studies YouTube or wants to better understand a site they use all the time should check out the full paper. It’s a long one:chock full of data, charts, and observations for those interested.

There are about 10 14 billion publicly visible videos

The method we used to generate a random sample allowed us to estimate the total size and growth of YouTube. Ten billion is an unfathomably large number, and it’s worth noting that we didn’t even count private videos. Further, our data is nearly a year old at this point.

Using a slightly different method analyzed in the paper, we can actually estimate that the total number of videos is probably more than 13 billion. Since this is information we know researchers will need in a more timely basis than is realistic in the world of academic publishing, we’ve set up a web-based dashboard to share this kind of data on a regular basis at tubestats.org

YouTube is mostly not in English

Contrary to the experience of most YouTube users in the United States or in other primarily English-speaking countries, YouTube is mostly non-English. Depending on the language detection software you use, only about 20-40% of videos are in English, which constitutes a plurality but also means that someone only paying attention to English YouTube is missing most of what’s there.

Our current best estimate is that 32% of videos where we can detect the language are in English, with 10.5% in Hindi, 8% in Spanish, slightly fewer in Portuguese, and just over 6% in Arabic.

Most of YouTube doesn’t get many views

Thanks in part to YouTube’s recommendation system, most of the videos we watch are vastly more popular than the average YouTube video. Odds are good that most of the videos you watch have more than 1,000 views, but those videos account for just 13% of our sample. A significant number (4.9%) don’t have any views at all.

There is a very long tail when it comes to popularity, and the top sixteen videos account for more than half of all views in our sample. The statistics are even more stark when it comes to comments and likes — most videos have no comments (72.6%) and no likes (88.7%) at all.

Not everyone is participating in the “creator economy”

People use YouTube for an incredibly wide variety of purposes. It is easy to get the impression that YouTube is dominated by apparently professional creators working within established genres, employing typical tropes and conventions to find the largest possible audience.

Taking a look at a random sample of videos makes clear that while these may account for a lot of the most popular videos, the vast unseen dark matter of YouTube is much more variable — and often strange.

There are short, effortless videos with nothing more than manipulated still photos, homework assignments and class lectures, barely audible three-second clips of ceremonies with no context, and even recorded Zoom meetings that could have no hope for an audience beyond someone who couldn’t attend live.

It’s there in the miscellany that you find YouTube the social network, YouTube the random video storage, and the many other YouTubes that aren’t pursuing a broad audience.

There are an awful lot of video games

It isn’t surprising that there are a lot of video games on YouTube, but we were not expecting it to be the largest category by a wide margin. To look at the categories uploaders choose to put their videos in, you would think that “gaming” constituted a sizable portion of YouTube: a bit under 13% of videos in our sample were in that category — the second most popular behind the default, “people and blogs,” which had almost 56%.

But our hand-coding task found when we hand-coded a subsample of videos, we found nearly 30% to be about video games, almost 9% higher than people and blogs.

Some of these were clips of video games with no apparent narration, some were “Let’s Play” style video games with commentary by the player, a few were people talking about games rather than playing them, and many were just friends playing a multiplayer game with microphones turned on. Other topics represented a smaller part of our sample than games but were nonetheless surprising, like religious content, which made up about 3% of the videos we coded by hand!

Check out the full paper

I’d encourage you to check out the full paper in the open access Journal of Quantitative Description: Digital Media for all the data and insights from our random sampling adventure.

Discover more from Initiative for Digital Public Infrastructure at UMass Amherst

Subscribe now to keep reading and get access to the full archive.

Continue reading