By Ethan Zuckerman
“Accents are just mouth fonts.” That brilliant observation is just one of the gems I found today on r/BrandNewSentence, an online community dedicated to collecting “sentences never before written, found in the wild”. Fans of these strange sentences also enjoy r/NatureIsMetal, which features images of animals being savage or brutal, and r/InstantKarma, where people behaving badly quickly receive their comeuppance.
I found r/BrandNewSentence and these other subreddits through RedditMap.social, a tool designed by Jasmine Mangat, Virginia Partridge, and Rebecca Curran of UMass Amherst, just released by the Initiative for Digital Public Infrastructure and Media Cloud, supported by the Knight Foundation. RedditMap uses Pushshift, an archive of posts and comments made to Reddit, to examine co-commenting patterns. By counting what users left which comments, we can cluster subreddits together. The more users who leave comments on both communities, the nearer those communities are placed to each on a map.
We are not the first to build a map of Reddit. Randal Olsen and a team at Michigan State maintain Redditviz, which maps Reddit across 59 “meta-interest” communities. Open-source developer “Anvaka” maintains a beautiful Map of Reddit that turns the vast platform into two continents, packed with “countries” like “Animeland” and “Furry Nation”.
In learning from these inspiring projects, we hoped to do a few new things with our map. We wanted to build a map that changes over time, reflecting the ways the Reddit community grows and evolves. We wanted it to be both a tool for scholarly study and exploration: you should be able to use our tool to see how influential Reddit communities like r/WallStreetBets grew and changed as they gained in popularity. We wanted to build a map that represented both how machines group subreddits together and the sometimes different ways humans order topics. We wanted the map to be useful for navigation: the map will mutate over time, but you should generally be able to find the same sets of communities in the same parts of the map. And finally, you should be able to use the map as a navigation guide, helping explore and discover unfamiliar Reddits.
Virginia built code that processes over 200 million comments a month and offers different ways of clustering subreddits together. Initially, we tried clustering by the content of the comments. This seems like a rational way to go, but it’s actually quite unsuccessful. The same people might talk about both Pokemon Go and Animal Crossing, but the two games are so different that there’s little overlap in the language they use across those two discussions.
Instead, we focused on co-commenting: we decided that two subreddits are related if the same people often comment on both of them. This gets you hundreds of clusters, and leaves you with the problem of determining whether those clusters make sense to humans, or just to machines.
So we had humans—including me!—spend many hours looking at clusters to determine if they were “coherent” (if you saw a cluster that mentioned Cricket, Mumbai and Bollywood, could you offer a label for it) and whether we could identify “intruders” in a category (which one of these is not like the other: Ottawa, Maple Leafs, Vancouver, Pokemon, CanadaPolitics?).
Once we’d tuned the algorithms to perform these tasks well, we had another problem. How do you organize these categories into a navigable hierarchy? Algorithmic approaches didn’t perform very well, so we tried a much more powerful technology: we asked a librarian. Rebecca Curran is not just a librarian, but also an expert on geek subcommunities, and she helped cluster and name our categories into hierarchies that we think are reasonably easy to navigate.
Our map encourages you to navigate using the treemap on the left, which shows you the “human readable” hierarchy, while also seeing how algorithms cluster subreddits together in the bubble map on the right. Maps are always a compromise. The bubble map actually exists in 100 dimensional space – what you’re seeing is a 2D projection of that super-complex data set. Jasmine Mangat built this gorgeous visualization that combines these two ways of mapping into a single tool you can use to navigate Reddit and discover communities you might find to be interesting.
Why map Reddit?
We previously worked with the Knight First Amendment Foundation to map the wide landscape of all social media online. So, why put all this time and effort into building a map of Reddit specifically?
Reddit is different from other social networks in some interesting ways
It’s organized by topic, not by existing relationships. Rather than following your friends and figuring out what they’re interested in, on Reddit you follow an interest and, perhaps, make new friends. It’s likely that information flows differently on Reddit than on social networks organized around offline friendship, which may give us lessons about mis/disinformation, propaganda, etc.
Reddit is moderated differently than other networks
Each subreddit has a team of moderators, who are volunteers from the community, who are responsible for setting and enforcing the rules of conversation. Scholars like Nathan Matias of Cornell’s CATlab see lessons from Reddit’s moderation for understanding governance of online spaces and mobilization of online protest.
We can study (almost) all of Reddit
That’s actually something that’s very hard with most online platforms. When social scientists study a community on YouTube or Twitter, usually they search for a small set of posts that match a certain criterion: videos whose titles include “Trump”, tweets containing #BlackLivesMatter. But what percent of all YouTube videos contain Trump, and how representative of YouTube videos are that subset of videos? It’s very hard to know, a problem I’ve called “the denominator problem”: we can know how many videos or posts we have to study, but not what percent of the total platform they represent.
With Reddit, we can solve the denominator problem. Pushshift, a volunteer effort managed for years by programmer and researcher Jason Baumgartner, collected all posts and comments from Reddit and archived them for research purposes. Our tool draws from the Pushshift archives, choosing the top 10,000 most popular reddits each month (which represent about 80% of the total comments on Reddit for that month). We can then see what conversations represent what percentage of the total dialog on Reddit.
For instance, Reddit is more permissive about pornography than many social networks, which might lead people to conclude that Reddit is primarily a porn site. Not so, according to our analysis. About 10% of comments in our dataset were posted on pornographic subreddits. That’s fewer comments than on subreddits dedicated to local/regional conversations like r/Boston and vastly fewer than comments about online/nerd culture (about 40% of Reddit comments) or even gaming (about 14%).
It is incredibly important to understand how big or small a particular conversation is compared to the bulk of Reddit as a whole. Research on social networks overall would be more comprehensible if we had a better understanding of whether specific online communities represent mainstream or fringe interests.
We have the right tools
Our Media Cloud tools are designed to study online conversations, particularly conversations in new media and in social media. We’ve had the ability for some time to study Reddit, using the API to the Pushshift database, but have previously only been able to search either all of Reddit or just one specific subreddit. Now that we’ve connected Media Cloud and RedditMap, we can search within the various categories our team has created and documented.
We think RedditMap is going to be deeply useful for social media researchers. But we also think it’s a lot of fun. Members of our lab reported losing hours, going down various internet “rabbit holes” to discover communities they hadn’t seen before. I found myself remembering an older version of the Internet—pre-Web—when “surfing the internet” meant visiting Usenet, a set of text-based discussion forums organized around interesting and esoteric topics. I remember, accurately or not, a feeling of serendipity and discovery being more common than I experience today on an internet where I am mostly led to see content my friends have shared.
There are some major limitations to our tool. First, it’s not done yet, and some significant features are missing. We plan on adding an in-map preview of the subreddit you’re going to visit. We also hope to add search soon, so you can see where we’ve categorized your favorite subreddit. And there are surely countless bugs we will be squashing.
Our current hierarchy of subreddits isn’t always going to make sense. This reflects the tension between how humans organize ideas and how our clustering algorithms see Redditors behaving. Exploring the tool today, I found r/worldnews, a subreddit I read regularly, clustered with subreddits about wargames. While we wouldn’t put books about world events near books about war games in library stacks, what our tool shows is that there is a huge overlap between wargamers and fans of world news.
Finally, our ability to keep the project going is in some jeopardy because of a current negotiation between Reddit and Pushshift over data access. We currently use Pushshift data to build our maps, which has given us the ability to launch a version of the map reflecting few months of data (we hope to be able to bring earlier months online in the next several weeks). But we also planned on maintaining the maps going forward, which means continuing to bring in data going forward. Our ability to do that may depend on whether Reddit and Pushshift can find an agreement that allows Pushshift to keep archiving Reddit data. Like all other researchers who depend on Pushshift, we’re trying to learn more about the future of that data set, and thinking about how we might maintain this tool without it in case Reddit and Pushshift can’t come to terms.
We hope you’ll use RedditMap to find corners of the internet you hadn’t explored before: unfamiliar, fascinating, strange, delightful, disturbing. If you do, put up a post tagged #redditmap so we can see what you’ve discovered.
And let us know what you’d like to see in the future of the tool, whether you’re a professional media researcher or someone using RedditMap for the joy of exploration. You can do that by joining our subreddit to discuss the tool r/RedditMapDotSocial or by submitting a feature request on GitHub.