Ryan McGrady and Kevin Zheng

97. There are 14 billion videos on YouTube. Mr. Beast, we hereby challenge you to watch them all.

Reimagining the Internet
Reimagining the Internet
97. There are 14 billion videos on YouTube. Mr. Beast, we hereby challenge you to watch them all.
Loading
/

Our lab’s Ryan McGrady and Kevin Zheng are taking a victory lap around some amazing work they’re doing here at the lab. Ryan just published an article in The Atlantic about the research he’s leading to understand how big YouTube is and what exactly is on it, and Kevin recently debuted his amazing tool to visualize that data, TubeStats. This week on Reimagining: the deep dive.

Transcript

Mike Sugerman:

Hey everybody, welcome back to Reimagining the Internet. I am your producer and your host today, Mike Sugerman. I’m joined by two of my colleagues here at the lab. We have Kevin Zheng and Ryan McGrady. Hello, Kevin and Ryan. Nice to have you. 

Ryan, you just published a great piece in The Atlantic last week called What We Discovered on Deep YouTube: The video site isn’t just a platform. It’s infrastructure. So that’s The Atlantic version of an article that you published along with Kevin and Rebecca Curran and Jason Baumgartner and Ethan called Dialing for Videos: A Random Sample of YouTube. That’s in the Journal of Quantitative Description: Digital Media for anyone who wants to read it.

And then, Kevin, you just launched a really fantastic tool here at the lab called TubeStats. You can find that at Tubestats.org, which is effectively a dashboard that visualizes the data that we’re pulling in about YouTube to do this research. So Ryan, Kevin, welcome. Thank you for joining us. 

Ryan McGrady: 

Good to be here, Mike. 

Kevin Zheng: 

Thanks for having us. 

Mike Sugerman:

This is a really ambitious project where you all manage to actually calculate roughly how many videos are on YouTube. Prior to this, people didn’t really know. Ryan, you mentioned in The Atlantic article that it’s like, is it a few hundred million? Is it a billion? It’s 14 billion. 

And then you’ve also found a way to actually random sample those videos, learn something about them, do some data analysis. And we’ll talk about that. But maybe a good place to start is, why did you do it? 

Ryan McGrady:

So hi Mike. We started this project as an extension of another project where we wanted to understand harmful speech on YouTube. And as we started to try to understand that subject, we realized that there was a lot that was missing about research on YouTube in general. So there are lots of these small studies about little pieces of YouTube that have no way to contextualize the information they provide because there’s no baseline information about YouTube information like how many videos are on YouTube, how many views do those videos get. And so that seriously limits what researchers can do with YouTube.

And indeed, YouTube has become kind of difficult to study on many big platforms. So this project was an attempt to try to be able to make claims about YouTube as a whole, both for its own sake, to understand a tool that we use every single day all over the world, and also as a way to provide more context to other studies about YouTube. That’s where it started. 

Kevin Zheng:

And what makes it especially difficult is that it’s all video, whereas other social media sites have mostly text-based posts. Most of YouTube’s data is in video and in audio. So if you were to have a large data set of videos, like how would you even begin to process all of that? So those are some of the questions that let us down the path of doing this research now. 

Mike Sugerman:

And maybe you could give some background to listeners about what it was like doing this research because some people might look at this and say, shouldn’t YouTube just make all of this data available? Like, why don’t they tell you how many videos are on YouTube? Why don’t they give a breakdown of the languages that are represented? Shouldn’t this be data they offered through their API? The truth is that it’s not that simple. In fact, it was quite difficult to study YouTube. Why is it so hard to study YouTube?

Ryan McGrady:

So that’s a great point, Mike. The idea for The Atlantic article, which really centers on the language that we use to talk about platforms or infrastructure like YouTube, the idea for that came from a shift in the way that I found myself talking about this project. First, we want to study harmful content. Before we do that, there’s all this baseline information, kind of what I just told you about the motivation. 

But as time went on, I found myself shifting to talk more about how surprisingly difficult this is to do in a rigorous way and how, frankly, frustrating and disappointing it is that it’s so hard that we can’t expect things like that from the default video arm of the Internet. If there is any digital platform infrastructure website that we should be able to expect basic information from, it is perhaps YouTube, but we can’t.

We, in fact, get shockingly little information from YouTube. They release vague statistics like hours viewed. So, for example, in a certain month, many billions of hours were viewed on YouTube. But it doesn’t say where those videos go, which videos those go to. How does that translate to actual people? How does it translate to actual videos? There’s just so little information that they actually release.

Meanwhile, it’s something that we have organized large parts of society around. There are a lot of people that depend on YouTube. And to the idea that we have to work so hard to find out just how big it is, is frankly kind of shocking. So, that is what led me to start exploring what are the reasons that not just that it’s so hard, but that we don’t expect that as much. 

And my theory, and I’m not the first person to make this argument, but my theory is that it’s the language that we use to talk about it. That we talk about it as a platform, as one of many options, a place where you can choose to stand and share your thoughts and express yourselves when the reality is that for a lot of people, YouTube is the only realistic option. And because of the way that it’s been built up and its presence in our society, it’s really functioning more like infrastructure.

 It’s something where if it were to disappear tomorrow, it would have profound impacts on many parts of society, of the world. I give some examples of that in The Atlantic article, but that’s really where that came from. It’s just that it came from frustration, from the things that exactly you’re describing. 

Kevin Zheng:

And the stats that YouTube does release, they’re pretty indicative of the site’s goals, which are to appeal to advertisers and to make money. Hours watched is a good metric for advertisers, but it’s not a good metric for how people actually are using YouTube. When we began this, we discovered that so many people are using YouTube for all these use cases that aren’t necessarily captured by the metrics that YouTube is sending out. We saw so many gaming videos. We saw so many recorded church sermons on YouTube. And those aren’t necessarily part of my YouTube experience and probably not most people’s YouTube experience, but those are YouTube use cases that I think we need to understand, that we need to recognize because people are using them and get value out of doing that. 

And as Ryan said YouTube serves as infrastructure for people to share videos and it’s become the de facto place to share videos with anyone in the world. So that’s a role that YouTube shouldn’t take lightly, that YouTube should value. And we want to make sure that those use cases are represented in research as well. 

Mike Sugerman:

So, just to kind of understand your argument here about why we would consider YouTube to be infrastructure, it’s because people use those videos to communicate with each other in some way, right? It’s really easy to imagine the situation where an immunocompromised person doesn’t feel comfortable attending church services, and they can still tune in to their local church on a Sunday, right? Or they can catch up on it later because maybe there’s not a live video of that thing. 

But communication via other means of communication that people use to communicate, right? Like mail or phones or something like that, which is different than I think what YouTube presents itself as when you go on the front page of YouTube, could you describe to me how what you see there is different from how you think people probably use this thing? 

Ryan McGrady:

Well, I think a lot of the difficulty in having the conversation over platform versus infrastructure has to do with a lot of the things that we do see on YouTube, that we see a lot of things that we can very easily describe as like frivolous and silly and say, how is that like fundamental to society in a way that these other things that we call infrastructure, things like electricity and gas, how is that possibly the same thing? And I think that the description of YouTube as a platform does make sense for a lot of people. 

The thing is, YouTube is used for so many purposes. So let me give you some examples. There are a lot of activists who film controversial videos. And there are several examples. I’m not going to give the examples here. Ethan has written about some of these before, but they are frequently targeted by activists, by perhaps governments, because of the activism that they’re doing. And if they were to host the videos on their own little website, it would very easily be able to be taken down either through some sort of filtering or through like a distributed denial of service attack where you have a lot of computers overloading one computer with different messages. It’s a way of censorship sometimes. 

On YouTube, you can’t DDoS, you can’t deny service YouTube. It’s too huge. It’s too central to the Internet. You also aren’t going to be able to as easily to bully them using legal means. They have a huge legal team. They have technical resources and human resources that an activist doesn’t have. 

On the viewer side, I use this example in The Atlantic article, but there’s a big example right now of Russia. YouTube is really popular in Russia. It’s become one of the only places to find information about current affairs in Russia, current conflicts in Russia that are not filtered through a government perspective, that aren’t through government media. And there are a lot of theories about why. 

It’s a little unclear. Some say YouTube might just be too popular in Russia. Some say they might not have a sufficient replacement because it’s really built up, invested in infrastructure that they just don’t have a good competitor for that could take its place. So there are the uploader side, the video creator side uses that are really more like infrastructure. 

And then there are viewer side uses that are more like infrastructure in addition to the CubeCat videos. It’s not all CubeCat videos and Minecraft streaming. And even some of those, you could frame them as being part of infrastructure. There are people who would not have a job if not for what they stream on YouTube and put on YouTube, for example. But yes, there are lots of potential examples. 

It’s tempting to reach for analogies like electricity in a lava lamp, but that’s a frivolous use of electricity. And so therefore we shouldn’t think of that. But it’s a fraught analogy to think of electricity. I wanted to be clear that I’m not saying that we should treat YouTube the same way that we treat utilities. I think that a lot of the time we’re trying to appropriate existing structures onto new ideas where that existing policy structure doesn’t quite match up. I’ve mainly been arguing to just start thinking about it and talking about it as infrastructure and let that shape our expectations of YouTube. 

Mike Sugerman:

Right. That’s what we were trying to say to prove to people that YouTube is more than the number of views that Mr. Beast gets on his most recent video. I don’t know if I’ll just keep it. But I’m going to let me phrase it in such a way that I can avoid this preamble. One thing that you are talking about, Ryan, that it does make me think about is a kind of interesting design quirk of electricity grids right when they were starting to hook up giant cities to electricity. The reason why Coney Island is where it is in New York is because it’s the end of the New York power grid. 

You can never really turn a power grid all the way off. So what you would do in the off hours was send your electricity from the entire city directed towards Coney Island to keep that powered, to keep that lit up, and then it kept the grid in action for when people started to demand electricity from it again and then it would flow back through the city. The thing that seems like the amusement park at the edge of town could actually be a critical component of that infrastructure. It doesn’t mean that it’s not infrastructure. It doesn’t mean that it’s not involved with infrastructure, but we do tend to see Coney Island more and pay attention to it more than we do to all of the lights that people have on all around the city. 

It sounds like a similar thing that YouTube is this big hulking infrastructure, but really, even the idea that Mr. Beast is particularly representative, one reason that’s flawed is because Mr. Beast speaks English. But I think you actually found that there’s a lot less English on YouTube than there are other languages. Could you tell me a bit about what you’ve learned about the international kind of use of YouTube, the non-American use of YouTube, which is honestly what most people listening to this podcast are probably predominantly familiar with in the first place? 

Ryan McGrady:

Yeah, so it seems like a plurality of YouTube is English. It does seem like it’s the most spoken language on YouTube, but most of YouTube is not in English. To put that in another way, for someone in the United States who primarily speaks English and is looking at their YouTube homepage, which is of course curated by the YouTube recommendation algorithms, most of YouTube isn’t even in the running to be featured in that. Never mind your watch history. Never mind your interests. If most of it is just not in English. 

So YouTube knows that if you don’t speak other languages, if you’re in the US and you’ve watched only English speaking videos, you’re not going to be interested in a video posted in Hindi, for example. But Hindi is a huge part of YouTube. Yeah, there’s a lot of interesting linguistic dimensions of YouTube to explore from the starting point of how unrepresentative the curated version is, how tailored your experience is to the language that you speak, but also, you know, we’re curious about how different language communities use YouTube. Kevin set up the language detection systems. They could probably say a little more about the range of findings that we’ve had or the range of results that we’ve had. 

Kevin Zheng:

Yeah, well, when I started working on this project in my undergrad, we didn’t have access to OpenAI’s Whisper model at that point. I don’t think it has been released yet. So we used an off-the-shelf language identification model. It didn’t perform particularly well, but it was good at helping us figure out some general trends within YouTube, which are the results in our Dialing for Videos paper. 

We found that most of YouTube was still English, but not a majority of YouTube. But once we got access to OpenAI made Whisper, their language model, open source, we took that and started applying our videos through their model. And we can say with pretty, pretty good confidence that around 30% of YouTube is in English, around 10% is in Hindi, and a little bit less in Spanish. So these are what we’re pretty excited about right now: looking at these slices of YouTube and looking at these language groups, how different language groups might have changed over time, how fast they’re growing, what they’re talking about within these language groups. So that’s something that we’re looking to explore a little bit more.

Ryan McGrady:

This really comes back to what we talked about earlier. Kevin mentioned earlier just how difficult it is working with video. And this is a big part of it. But if you think about the kinds of things that you normally see transcribed, you think of professional productions. It’s one thing to transcribe a professional, you know, CNN’s video on YouTube. It’s another to transcribe some impromptu video at a party in a country where you’re not sure what language is being spoken. 

Like you can make out little pieces of it. The transcription software doesn’t typically differentiate speakers. There’s no punctuation. Like it’s, there are a lot of limitations, even with, from what we’ve been all, to tell state of the art in language transcription, and even just language detection, that that’s been one of the big challenges in working with YouTube, with video data, or specifically with audio data from YouTube, is that it’s really hard. It’s been, there’s been some trial and error involved, and we have a lot more work to do to ensure that we are doing it in a methodologically rigorous way. 

Kevin Zheng:

And ideally, we’ll be incorporating human review. At this point, we’re pretty much just trusting these models that we’re getting from, you know, publicly available places. But we do want to check to make sure that they’re actually doing what they’re saying they’re doing and identifying the languages that are actually present on YouTube. So at this point, we’re trusting, we’re trusting OpenAI’s model and how they’ve trained it. 

Mike Sugerman:

So maybe this is a good time to talk about those videos of an impromptu party, right? Most likely, unless that’s like a video where something really remarkable happened at that party, or something really funny happened at that party, or it’s a famous person who is in the midst of being unceremoniously canceled, that is probably a video that is not going to have a lot of views. 

In fact, the vast majority of videos on YouTube don’t have a lot of views, right? I think most of what we’re dealing with here are what we call low-view-count videos. Yet these are public videos, right? They’re not private videos. You can only study public videos for what you’re doing. There’s no way for you to like break into people’s accounts and randomly sample what they have set to private. But that is kind of a weird territory to be in, right? It’s like, yes, these are low view count, and they are public, but since they are low view count and public, they might not be like public, public. They might not be public in the way that people would be particularly happy if a clip from their party went viral on the Internet, even though it was publicly available on YouTube. 

So maybe this is a two-parter. One, can one of you please give me the exact number of what amazing percentage of videos have lower than what number of you count? And two, yeah, how do you handle this question of, like, how do you figure out as researchers what is reasonable to study, what’s reasonable to share to put your findings, and then what users of YouTube can reasonably expect as some version of privacy, even with stuff that they don’t have set to private?

Kevin Zheng:

Well, our most recent sample gave us a median view count of 40, which means that 50% of videos have fewer than 40 views, which is surprisingly low. If you were to sample my home screen, I would say the median view count would probably be like 10,000, 20,000. So what we come to expect out of YouTube is really just scratching the surface of all of the videos out on YouTube that aren’t super widely viewed. 

And what we’re most concerned about is, Ethan likes to bring up the Justine Sacco story of some little known person tweeting before getting on a flight to South Africa and losing her job by the end of the flight. And it was meant to be something innocuous but was taken the wrong way by the public. And it wasn’t meant to be viewed widely beyond her close circle of friends.

 So that’s something that we’re being really cautious about with these low-view count videos that we’re encountering and are sampling. What we’ve landed on so far is developing our statistical models of YouTube. So we’re taking in all of this metadata and we’re creating percentile ranges for view counts for likes, for video duration, for the upload year. 

So, all of that is aggregated, anonymized, and then presented back to the public through a range of percentiles. So if you have a video and you want to see what percentile is my video in terms of view counts, you can go to TubeStats and do that calculation on your own. But we won’t give you all of our samples and the actual videos that we’ve sampled. 

Ryan McGrady:

To watch some of the videos in the random sample, some of them are a handful of them are popular videos and look like the kind of videos that you’re used to seeing. Some of them look like the kind of videos that you’re trying to see, but it’s like people who are trying to make it by emulating the popular creators; they’re following these kinds of established genres and conventions of popular YouTube and trying to join the creator economy. Some of them seem like they’re just using YouTube as an archive. To store a video that they just took, sometimes it’s not even clear that they knew it was being uploaded. Like, it was just somebody who seems to be fumbling with buttons on their phone. There are some of those. 

In some cases, it’s people that know they’re uploading to YouTube, but they have a specific audience in mind and that audience is extremely limited. So like, one of my favorite first examples that I saw was there was a condo board meeting that seemed to be uploaded because somebody in the condo couldn’t make it and so it’s uploaded to YouTube because that’s just where you upload stuff. It’s not going to get a lot of views, again, unless something goes wrong. And there are lots of those examples. There are homework assignments in classes, especially during the pandemic. There are lots of classes, lessons uploaded to YouTube. There are lots of different uses. 

So because there are so many that seem to not want or not appreciate the potential audience of YouTube is why, as Kevin said, we’re trying to be cautious. We haven’t exactly established a policy on this, but we’re airing on the side of not sharing specific video data, even though it’s really hard because some of them are just so weird and great that we’d love to share them. Some of the — like, a couple of these homework assignments that kids have uploaded to YouTube are just delightful examples of strange uses of audio-video on the Internet. And while some of them might really appreciate lots of extra views, I think that we’re going to air on the side of not for just now.

Mike Sugerman:

 I was part of the group of people who was “hand coding” some of these videos, which meant Ryan used to send me a big list of videos and I had to watch them and fill out no effect and Google Form that was really tedious after the 50th video. But I found myself thinking about when I went to a Celtics game in Boston a few years ago, and somewhere during a commercial break, they were showing a video of their star player, Jason Tatum, that someone found on YouTube from when he was in high school for a class project where he was showing how to tie a tie and it’s an embarrassing video.

It’s embarrassing if someone digs up your class project from high school and shows it at your job. It’s actually probably only acceptable if you’re a public figure because you’re one of the best basketball players in the entire world. But that’s a really interesting line to draw. I think it does kind of show the kind of, like, delicate nature of this stuff, right? The questions of when are things public, when are people public figures, and do we all deserve to have that treatment? 

Ryan McGrady:

And if I may add one other example here, one of the really startling things that a lot of us noticed during that hand coding part of the project and for context, we had a sample that we generated that was about 10,000, and we took 1,000 of those and actually watched them to see what they are in order to get a sense of the content. And that’s a separate section of the paper. 

So one of the things that several of us noticed is that there are a lot of kids in the sample, so kids that seem to be below the age of YouTube’s terms of service, you’re supposed to be 13. And some of the kids dancing, kids playing Minecraft, kids just, like, goofing around. 

I mean, there was more than we were prepared for, and that, more than anything else, is what made me feel like we should really be treating this in a secure way. And also makes me think that that’s something that needs some further investigation because it wasn’t just, like, videos that were uploaded in the last week that will, you know, be taken down by YouTube as soon as they realize that the uploaders were underage. Some of these are very old videos, which makes me wonder if there is a larger issue of, you know, YouTube not realizing that there are a lot of kids on this platform?

Mike Sugerman:

You’ve mentioned it a little bit. But TubeStats, again, is at Tubesats.org and is a really cool tool that you built. You did this. You made it look really good. You’ve designed a really nice logo for it. And Kevin even designed some really good alternate logos that I’ve been trying to convince him not to scrap because I think it’s nice to have some good alternates. What is TubeStats? What does it do? And why is it going to be useful for more than just this project? 

Kevin Zheng:

TubeStats is where we’re sharing data from our samples. We’re continuing to sample about once a month to keep everyone updated on what YouTube looks like. And what we’re offering is an estimate of how big YouTube is. And to get more, like, fine-grained, we also have how many videos are uploaded per year, since its inception back in 2005. So those help us track how fast YouTube is growing. We have data on all of the metadata fields that we can get from YouTube. So, we have view count and duration. And we also have our language distributions that we’re doing the analysis for on our servers. And presenting that back out through TubeStats. So we have all this data. 

We’re hoping that down the line when people are doing their own research on YouTube, if they need to know how big YouTube is, come to TubeStats, see what the most up-to-date number is. Every month, it keeps growing. Our most recent sample puts us at 14.4 billion, which is up from 14.1 billion back in December. And up from 13 billion back in November. So every month it’s growing. We’re hoping to track that change over time. And as an archival tool, we think that TubeStats is going to be really useful. Let’s say 10 years down the line, when we want to look back at 2023 YouTube or 2024 YouTube, we’ll have those numbers. We’ll see the platform’s trends and statistics at any point in time. 

Ryan McGrady:

It’s worth saying that those numbers are way up from the paper, which was based on data collected in 2022, late 2022, which came to an estimate of about 10 billion. So that’s about 4 billion over the course of a year, a little more than a year. And it says something about the size of YouTube and also about the importance of the site that Kevin built to be able to have this up-to-date information. The tool that Kevin has put together is really spectacular and I think something that will be frequently cited moving forward as academics, as journalists, as policymakers need fundamental information about YouTube that you can’t rely on YouTube to provide. 

Kevin Zheng:

Can I say a hot take? 

Mike Sugerman:

Yes, of course. 

Kevin Zheng:

You can choose to keep the center dot. But I think YouTube is perhaps hesitant about releasing data like this because it shows how unwieldy it all is. I think knowing that there are 14 billion videos on YouTube means that YouTube probably doesn’t have a handle on all 14.4 billion videos. 

For an advertiser, that’s horrifying if YouTube doesn’t know what’s happening on their own platform that they are selling back to advertisers to advertise on. That’s a scary prospect to know that we don’t know. I know there’s so much we don’t know about YouTube. In terms of brand safety, in terms of brand confidence, that’s something that I think YouTube is maybe a little bit less comfortable with, with the public knowing it.

Ryan McGrady:

 I would add on to that and suggest that regulators, may be an even more dangerous audience, to learn about the unwieldiness of YouTube and the extent to which so much of it is dark matter because YouTube can do a good job of ensuring that certain videos are shown to people. Their algorithm drives at last; they released this statistic to about 70% of all traffic, which is a huge percentage considering how many times you see people calling for you to subscribe and link to their videos. The amount that is unseen, and I just mentioned that there are a lot of apparently young kids on there that YouTube apparently doesn’t know about. That seems like the sort of thing that if we ever get a Congress that would like to do any sort of real legislation about our digital infrastructure, that seems like something that they would be interested in, let’s say. 

Kevin Zheng:

I know other researchers like Evelyn Dweck are interested in bringing YouTube to Capitol Hill to bring them to testify about what YouTube is, how it works, and how they might be regulated to make it safer for everyone and a better experience for everyone. 

Mike Sugerman:

Yeah, I mean, I might be out of pocket here. This is really not my area of expertise in the least, but my kind of lay person grade perspective is that if YouTube is a place where there are a lot of children uploading content below what should be the limit, and if mainly the thing that YouTube does as a market actor is sell ads, then there is some component of it’s inevitable that children are involved in the sale of ads. That does sound to me a little bit akin to a child labor style issue, not to mention a much bigger child safety issue, not to mention a much bigger, bigger issue about your right, Kevin; the implications is Google, who runs Google AdSense and YouTube really as good at selling ads as it needs all the people buying ads to believe that it is. 

One thing that I will say is that a lot of in video ads are not running on these kinds of random, low view count videos. So, I don’t think that you’re going to see a lot of ads for creatine nutritional supplements on videos that kids accidentally uploaded, but still, it’s murky. I wonder if either of you have any insights on what some policy implications of this research might be? One thing is maybe it actually is a pretty good section 230 style argument that, “Wow, gosh, we as YouTube know that there are so many videos on this platform that we can’t even keep track of all of it.” See, we really are just a communication vector. We’re not a publisher. 

But on the other hand, I think this does run afoul of some basic DSA concerns, Digital Services Act, the law passed recently by the European Union to regulate the Internet. Yeah, what might we expect in terms of a kind of regulatory future of YouTube given this kind of data that’s emerging?

Ryan McGrady:

What might be expected as far as what’s likely, I’m not sure that in the United States all that much is likely. Our legislators have given a lot of lip service to regulating large platforms and like to take the CEOs in and ask them lots of pointed questions. But they’re not so keen about passing laws about it. I find the DSA very interesting. I think that there is a lot that is still very unclear about how it will be implemented. And I think that we should really pay attention to see what we can take away.

 I think that it does a couple of things that are very interesting. One is it provides some new vocabularies to use, whereas our conversations here are often centered around Section 230, which for context is a law which provides protection to providers, quote unquote, providers to make it so that they don’t have responsibility for what their users do on their websites. And there are a lot of implications there. There have been a lot of debates over whether we should get rid of that or keep it in place. 

And I tend to think that the result or the best option is somewhere in between or rather just do a different set of legislative initiatives that don’t necessarily tie themselves to just yay or nay on Section 230 because there’s a lot to like about Section 230, even if there’s a lot to criticize as well. 

But we should one way or the other be able to expect certain basic kinds of transparency about the sites which play such an important role in our lives. Transparency, we should be able to have researchers be able to look at the data. They should be able to audit the recommendation algorithms. We should be able to have a robust API. We were, there are a lot of people, Ethan among others, have written about the golden age of platform research perhaps being behind us because all of the APIs and would be Twitter, Reddit or even YouTube are shrinking rather than expanding even while those sites play an increased role in ordering society. So we should be heading away from that. We should be expecting more. And the DSA may provide some insight in its use of different vocabulary.

 It also does have a lot in there that spreads regulatory responsibilities around it. It doesn’t just create a government entity to do everything. It creates rules that allow researchers to do things. It means that it’s something other than industry self-regulation because as Twitter has shown and as other of these sites have shown, that only works until somebody just decides to stop.

Kevin Zheng:

I think more data access is always good, especially for researchers, whether it be in collaboration with the platforms or independent access to these sites. There’s some good work out there on Facebook that is directly in partnership with Facebook. Unfortunately, there’s only so much you can say to ensure that you continue to have access to that data. The same would probably be said if YouTube were to eventually have or if YouTube were to provide us with data directly from their systems; I think to ensure that we continue to have access to that data, we probably wouldn’t be allowed to say too many things that hurt their bottom line. 

So there are a lot of limitations with data access and we’re hoping that we can, through TubeStats, through our other project Reddit Map, continue to advocate for independent researchers and their access to social media data. 

Mike Sugerman:

And for anybody who wants to learn more about this stuff, in episode 16 of this very show, we talked to Elizabeth Hansen Shapiro about a paper that Ethan and I wrote with her around platform data access back in 2021. Yeah, it was published in 2021. And then we ran a few episodes last September. I think those are episodes 87 through 89 about those Facebook studies that Kevin referenced. We interviewed a few academics who had some insight into things they learned from those studies. Maybe episode 86, we interviewed Laura Edelson, someone who was herself involved in researching, advertising on Facebook, and had her advertising research shut down. So yeah, plenty to catch up on this show if you want to learn more about platform data access and that sort of stuff. 

Kevin and Ryan, thank you for a really fantastic summary of this research. Thanks for explaining TubeStats. I think there’s more to come here. So as always be sure to continue listening to podcasts, but also a great way to find out about more TubeStats work and research like this would be signing up for our newsletter, which you can find at our homepage, publicinfrastructure.org. Kevin Zeng and Ryan McGrady, thank you so much for joining us. 


Comments

Leave a Reply