
We’re thrilled to launch a new tool today: a big interactive map of Reddit, showing how biggest subreddits on the site are connected with each other. Mike is joined by iDPI’s very own Jasmine Mangat and Virginia Partridge for a riveting tell-all about RedditMap.Social.
You can visit the tool at RedditMap.Social, talk to other redditors about it at r/RedditMapDotSocial, see the code at GitHub, and read Ethan’s post about explaining the whole thing. Enjoy!
Transcript
Mike Sugarman:
Hi everybody, welcome back to Reimagining the Internet. I am Mike Sugarman, and I am sitting in once again for Ethan Zuckerman. I am really excited about today’s episode. We’re going to be talking about the next big project coming up the lab, and it comes out today. It’s called RedditMap.social, and it’s basically what it sounds like. It’s a big map of the biggest subreddits on Reddit. Let’s see how many times I can say “map” and “Reddit” in this episode.
I am joined by two of the main people who worked on this project at the lab.
First, I have Virginia Partridge. She’s from the Center for Data Science. She also works with us on Media Cloud and IHOP, the International Observatory of Hate Project. Virginia, thanks for coming on the show.
Virginia Partridge:
Hello.
Mike Sugarman:
And I also have Jasmine Mangat. Jasmine is a, I guess you would say, research assistant on this project. I would say more than a research assistant. Pretty crucial to making this look so beautiful and work so well. And she’s also an undergraduate, at least for a few more weeks. But Jasmine, thank you for joining us.
Jasmine Mangat:
Hi.
Mike Sugarman:
Can one of you tell our listeners what is the Reddit map?
Jasmine Mangat:
Sure. So the Reddit map is a tool that we developed that features two different visualizations. One is a tree map and one is a bubble map. A tree map helps view data in a hierarchical form while a bubble map helps view kind of the clusters within that data. So Virginia worked on an amazing project that helps cluster together subreddits on the platform Reddit based on user comments. And I helped build a tool that helps visualize these clusters. The three map basically helps navigate the tool and all the different communities on Reddit based on topics that are human generated by our amazing online content curator, Rebecca Curran, to help people kind of navigate through the visualization. And this works alongside the bubble map, which helps view all the different clusters and another cool thing about the tool is that it helps you view how these communities evolve over time.
Mike Sugarman:
Virginia, maybe you can fill us in on some specifics about this. What are these communities they can find on the map? How many are there? How big are these communities? I mean, it’s not every single community on Reddit, right?
Virginia Partridge:
No, so we took the top 10,000 most commented on subreddits for each month. So I think one of the unique things about our approach, which uses some methods which were developed before, specifically community2vec, which was a paper from, I think, 2015. But the unique part of our approach is that we look at the top 10,000 subreddits for each month, and we were able to validate that you can perform this method on just a month’s worth of comments, Reddit comments, and still get sort of human interpretable results that make sense to people who are familiar with Reddit and for people who are unfamiliar with Reddit, they can use it to explore Reddit, figure out what kinds of subreddits they might be interested in, find people who share their interests.
Mike Sugarman:
To kind of give listeners a sense of how you might use this tool, you might go to RedditMap.social and you’re going to see, like Jasmine explained, a big block that contains smaller blocks and a big cluster of bubbles. And you can kind of just click around and explore. I’m going to look at a prototype of this right now. I’m on RedditMap.social. I’m seeing the biggest block as hobbies and media interests. The next biggest block is porn, then local, then finance and economics, religion, that’s a lot smaller, and then education is there also. And then we have similar colors in this kind of right cluster of bubbles, which I guess represent these big blocks. Some of the bubbles are closer, some of them are farther from each other. I’ve heard this described in the lab as neighbors. Can someone explain what a near neighbor is and what it means for how Reddit is structured?
Virginia Partridge:
So the way this method works is you, for each user on Reddit, each Redditor, they’re going to comment on a bunch of different subreddits over the course of a month. And you’ll create a community embedding based on predicting what other subreddits a person comments on. So, say I comment on the Red Sox subreddit, what’s the probability that I’m going to also comment on a baseball, generic baseball subreddit. And if that sort of prediction task makes sense and a lot of people behave in a similar way, then those subreddits will be closer in this multi-dimensional space that gets created. So that’s what the nearest neighbor idea is based on, is like subreddits that people often comment on together are going to be closer together.
Jasmine Mangat:
I think one other thing I want to emphasize with the nearest neighbor concept in relation to the visualization is that because each subreddit can be described by a multi-dimensional space, however, our human capabilities are not able to visualize that multi-dimensional space, we’re limited to a 2D visualization. And so in order to still make sense of nearest neighbors, we have a feature on the tool where once you get down to a certain level, the tool actually connects the nearest neighbors to a given cluster so that you can kind of see that just because two clusters might be next to each other on the bubble map, that doesn’t necessarily mean that they’re actually close to each other in terms of user comments.
Mike Sugarman:
That’s interesting. It sounds like you’re starting to describe string theory for data science. Yeah, so, okay, let’s kind of go through this website structure a little bit. So we just kind of looked at like what you see when you first go on to RedditMap.social. You see some blocks, you see some dots, you see some neighbors. Let’s say I go one layer deeper. Let’s say, again, I’m just on this page, I’m going to click, I’m going to click on finance and economics, including crypto. And that explodes into a series of subcategories. I’m seeing crypto, non-Ethereum, crypto mining and Ethereum, finance and startups, US centric wealth and personal finance, stock trading and crypto trading. Actually, you know what? I’m going to have more fun than this. I’m going to go to hobbies and interest because I know there’s good stuff in here. I’m going to go to offline hobbies and lifestyle. And then I know there is pretty good stuff under physical mental health wellness and relationships. And it just keeps going. And I can keep clicking through. Eventually I get to actual subreddits, right? If I go into physical and mental health, now I actually see there’s r/alcohol, r/herpes, r/zerocarb. A lot of these are health related. There’s sleep, some of them are medication related. There’s Prozac. Jasmine, Virginia, I see me spend a lot of time playing with this tool. What are some of the more interesting subreddits you found? And you know, if we’re talking about neighbors, what are some of the more interesting neighbor relationships that you found between subreddits using this tool?
Jasmine Mangat:
Can I pull up an email I sent Ethan yesterday?
Mike Sugarman:
Yeah, yeah, go for it. You’re hearing something exciting on the podcast, which is someone pulling up an email. We so rarely do this, but this is how the sausage is made at to the lab. Yeah.
Jasmine Mangat:
So yesterday I was kind of just going through the tool procrastinating on the rest of my work just to find some interesting subreddits. And I think some of the favorite ones I found were We Want Plates, R slash We Want Plates, which is about people just sharing weird and interesting ways that they are served food at a restaurant, so maybe they have burgers in a bucket, or I think one of them was like a bunch of sauces on a mini Ferris wheel.
Mike Sugarman:
I’m on here, I’m seeing butter on a rock and it’s a pile of butter on a rock. I’m seeing marshmallows served on a log. Yeah, I understand why people might want plates. (laughing) Yeah, what have you found Virginia?
Virginia Partridge:
One of my favorite findings was during the process of writing the paper, we had people annotate different clusters and try to say if they made sense or not, and also figure out how stable they were if some clusters appeared every month and some didn’t. And there was this kind of funny thing. This group of subreddits called, they seemed really unrelated at first. It was like distant socializing, read with me, Reddit sessions, and both me and the other annotators, which are all the other people in the lab, didn’t kind of know what to make sense of this. We thought there was just incoherent, this weird thing showing up. But as we dug into it more, it turned out it was this feature that Reddit had rolled out during the time period we were studying, which was 2021, 2022. And it was a live stream features called Reddit Public Access Network. And our method did just a really good job clustering all of these subreddits together, which were all just people, live streaming, different activities they were doing. And it was a really interesting discovery that none of us knew about this feature before. And so we were able to find out about it and it seems to have been this really dedicated community of users that were really active there that we were able to find out because of this. So that was a really cool one.
Mike Sugarman:
And I’ll just say if you listener use this tool at ReditMap.social and want to talk to us, want to talk to other people also exploring it about what you’re finding, we would love if you do that, especially if you’re a Redditor. We actually made a subreddit where you can do it. r/RedditMapDotSocial. “Dot” is spelled out, so that’s Reddit Map D-O-T Social. We’ll be on there sharing things that we find interesting. We hope you do also. Sorry, Jasmine and Virginia, I’m still distracted looking at this. On r/WeWantPlates, there is what I’m looking at, is sushi served on an LED screen that has a video playing on it. That’s kind of a plate. I see how it’s not a plate. Oh no, this is going to go off the rails really fast if I keep looking at this. Okay, I need to close Reddit. So I would be really curious to hear what the history of this is. Because Jasmine, I remember when IDPI, when we had our big launch meeting in September of 2021, you were there and you were talking about doing Reddit visualizations. So I’d love to hear some background about it. I’d love to hear some history. Where did this come from? Why’d you do it?
Virginia Partridge:
I think the first time I started working on this paper was when I had just I started working with Media Cloud, and there was a paper about Reddit and studying political communities on Reddit by Waller and Anderson that was in nature. And Ethan came to me and asked me if I could reproduce the method they used, if I could reproduce that study. So I started doing that, and we were able to do it fairly successfully. So that was the initial community to vet community embedding Reddit like map.
Jasmine Mangat:
Yeah, so I came int iDPI working on a different project, but eventually I emailed Ethan asking if he has any more projects for the coming year and he mentioned the project that Virginia was working on which at the time I’m not exactly sure what you were doing but I think it was something related to starting to cluster together, doing K-Means clustering on the community2vec work that she had done. And they’d been looking for someone who could make the work that Virginia is doing accessible to a wider audience. So that’s when we started discussing, seeing what kind of tool would be helpful to accomplish this.
Mike
And correct me if I’m wrong, This is an open source project, right? What does that exactly mean in this context? Because I think a lot of people would just look at this and say, hey, it’s a cool thing I can click around on. Yeah, why is this open source? What would people potentially do with this tool? If they wanted to build on it, iterate on it, something like that.
Jasmine Mangat:
Yeah, so I think one of the purposes of this tool is that there aren’t, first, there aren’t that many social media platforms that provide that bunch of data about what’s going on. Reddit in the past has been one of those platforms. And another problem on top of that is that there aren’t as many tools to help maybe non-technical people explore what’s going on these platforms. So one reason why we wanted to make this open source is that people have different needs as well. We’re trying our best to aim this towards social media researchers, journalists, activists, whoever is interested. However, if someone who is using the tool and finding it useful, but there is another or feature that they would like on it. We want to make it open source such that they can suggest that feature and hopefully any dedicated developers that are also interested in the project can help move that forward. And anyone who just wants to add another cool visualization is welcome to do so. And it’s a way that we can kind of share what’s going on on social media with one another without just limiting it to the companies or the people who have the kind of skills to do so.
Mike Sugarman:
Sure, and I think that’s really important, right? Because if this is something that’s going to help people visualize communities, I mean Reddit displays where people really make the communities happen, right? You don’t have subreddits without the actual redditors. Yeah, so I love that you’re saying this isn’t just one way street that like we’ve made this thing look at it. We’re encouraging similar participation as what people already do there.
So Reddit, I think, has been a—well, we always look to it as a pretty positive social network. I mean, historically, it does have its problems, right? Like, Gamergate started on Reddit. There was r/TheDonald. I mean, just like every place online it’s had its serious issues, especially in the past several years. But things we like about Reddit are the kind of community participation aspect, volunteer moderation. Something that Reddit and until recently Twitter were really exemplars of was decent data access, which is nice. It’s nice for us because we like to study social media. It’s basically resulted in the research community really leaning on Reddit and Twitter to study social media because they actually make data available through APIs and all of that.
But I think, you know, a lot of people listening to this podcast probably don’t access Reddit data very often. they probably don’t do their own research by social media. It might just be kind of interesting to hear from YouTube, people who have a lot of experience with this, what data is available and what data were you able to use to make this tool to do this research?
Virginia Partridge:
Yeah, so Reddit, you’re right. Reddit has been historically very open about getting access to their data. They have their own data API, which you can access through like Python tools or just regular REST API requests. But yeah, that’s definitely only accessible to developers. And so one of our partners in the past has been Pushshift, which made basically bundles of data pulled from the Reddit API that you could just download. And that Pushshift data set has been cited by hundreds of papers. A lot of people have used it. Unfortunately, Pushshift and Reddit have disagreed about terms of the data access so that Pushshift data isn’t available at the moment. We’re waiting to see what’s going to happen there. And people have also raised concerns about Pushshift data, GDPR compliance, and things like that. So yeah, we need to see and watch what happens there But the Reddit data API is still available to people. And definitely the data that you need to research Reddit will still be—we think will still be available in some way going forward.
Mike Sugarman:
Could one of you kind of give me a breakdown of how more everyday type Reddit users use data? I think there’s situations—I think moderators are—they need API access a decent amount. But maybe you could just kind of explain to our listeners what people do with that data besides things like build RedditMap.social?
Virginia Partridge:
Yeah, there’s like bots people use on reddit all the time. So if you’re on, I’m like telling what my social media habits are, but I go on the r/crochet a lot and where people post pictures of like crochet projects that they’ve made and underneath it there’s a bot that will say, “Hey, can you give a link to your pattern if you used one?” And describe what yarn you used to make this tool. So people use the data access for things like those kinds of bots or other moderation enforcement policies.
Jasmine Mangat:
And I think in the past, people have pursued similar projects to Reddit Map. One project that we really enjoyed using as we built our tool was Map of Reddit. They also use the Reddit API to help cluster together subreddits. And currently, their project is open source as well with other developers working on categorizing the different topics covered by subreddits. And I think this goes into the fact that there is a lot of conversation within Reddit about the Reddit community as a whole, as the whole platform or certain sections to kind of see how different communities interact, the different trends that are going on, et cetera. So the API is also helpful for the internal community to just learn about the environment around them. And there is this question of if Reddit will continue to let that data be as open.
Mike Sugarman:
So to give some context, a few weeks ago, Elon Musk said that he was going to shut off the open Twitter API for our lab and the research that we do about Twitter. That means that in theory, Twitter wants to start charging us $44,000 a month for data that we previously had access to for free. And then Reddit made a similar move limiting their API access also, which has implications for Pushshift. I think Virginia, as you rightly pointed out, Pushshift has not always been so up on the GDPR compliance. So there’s arguably from Reddit’s point of view, a reason to be a little stricter with the data access.
But on one hand, this really seems like an existential threat to not just research, but kind of an open web. Being able to contribute to a social network and be able to get your data back from it seems like a pretty fundamental thing we should be able to do. On the flip side, I think with Reddit specifically, they’re aware that a lot of AI large learning models are being trained on their freely available data. So I think Reddit says, hey, look, if Microsoft, Google, whoever else, these huge corporations are going to build AI technology trained on our data, and we’re not going to get even credit, that’s a problem. We should have some kind of licensing. We had a lab meeting yesterday where Ethan was talking about this, and he thought that was pretty fair. I think he made a good case for it. I think, unfortunately, it’s one of those things that seems a little bit bad from both sides. But well, one thing that we can do in the meantime is we can actually take a look at what the day that we have is saying, glean some trends from it. Virginia, I know they are actually presenting some findings from this long-running research they are doing at the International Conference on Web and Social Media, the ICWSM. And you wrote a really detailed paper that I had a hard time making a ton of sense of because I’m not a data scientist. I’m the media guy. But I would love it if you tell us a bit about, tell us about that paper. Tell us about what you were saying in that paper and what are some interesting things that people might want to know.
Virginia Partridge:
Sure. So we’re submitting it. It hasn’t been accepted yet, but we’re submitting it. And the main idea of that paper is to show that these community embedding community to VAC methodologies, which again, people have done before, but on large amounts of data, years worth of data. But actually, you only need about a month to get decent performance. So we validate this method in several different ways, and those are all laid out in the paper. And we also look at the temporal trends. So these like, how do neighbors change over time? How does the embedding space change over time? And we found, you know, it’s gradual and also can reveal certain trends that align with things we know about in the real world or even reveal new things. So like one example is that we found that the neighbors of subreddits about NFT marketplaces, So buying and selling NFTs, in the first part of our data, we’re changing a lot. They weren’t very stable. But then as NFTs became more popular in 2022, they became more stable. So this NFT community emerged in 2022, and we could see that. So that is the main idea of this paper, is yeah, changes over time. You can track at the monthly level. and they make sense to people. And we really think there you can trust the output and use it.
Mike Sugarman:
For people who are not super first in the data science side of things, what’s the type of thing that’s really interesting about this work? And what’s the type of thing that kind of you had fun with as you were going through your findings and writing this up and thinking of how you want to present this to the world?
Virginia Partridge:
Yeah, I think the fun part of the paper for me was taking my background as a natural language processing. I usually work with large text corpora, so not like user data. So the interesting thing for me was taking some of those methods I had known from natural language processing. So this community embedding thing comes from something called word embeddings. And learning how to apply those methods to a different type of data, that was really interesting to me. I was able to draw on work I had done on topic modeling. Similarly, how do you categorize documents? By what words appear in them, what topic they’re about. And try to translate that work to clusters of subreddits. So the two annotation methodologies that we used, which were like subreddit, intruder and check-in coherence, those come from the topic modeling community. So I tried to be sort of interdisciplinary in that approach and that was really interesting for me, just like applying it to new space.
Mike
Do either of you have any messages for people who might want to explore this project, who might want to know more about it, who might want to just play around.
Jasmine Mangat:
Don’t use it as a procrastination tool. I’m kidding.
Mike
So I hope they’ll do it, because then they’ll use it.
Jasmine Mangat:
Yeah. But I think when we were building the tool, we were thinking about how we can really make it exploratory. And I think Virginia also mentioned how even, even though the clusters that we have are computer generated, Rebecca did some amazing work in terms of labeling them and kind of demonstrating how while we might think that oh, these are objective findings from the computer, when we’re trying to understand them as a human, we often come up with very subjective ways to categorize. And so throughout, while using the tool you might find, oh, well, this subreddit shouldn’t belong to this category, I don’t understand why it’s there. But it’s important to think about how we categorize people’s interests. It’s often there’s more going on than what meets the eye. And it’s really easy to also make assumptions about it. So I think exploring and trying to connect what patterns you see with perhaps real life events or really going into these communities to try to understand what was going on at that time is really a way that we can go forward with kind of social media research and also fulfilling your own curiosities with online communities.
Mike Sugarman:
Virginia, Jasmine, thank you so much for joining us today. You did great work here. I think everyone’s going to be really excited to check out RedditMap.social. Again, it’s just RedditMap.social. If you want to talk about it on Reddit, it’s r/RedditMapDotSocial.