Aman sits on a train traveling from Denmark to Sweden. The duffle bag next to him contains several pounds of marijuana. The train slows sooner than predicted, and he tenses — when it crosses the border, the Swedish police will check passengers for illegal substances, and he has a loton him.
He rushes to the bathroom, climbs on the toilet, removes the ceiling tiles, and empties the contents of the duffle bag. After sliding the tile back in place, he exits the bathroom and walks calmly back to his seat. When the police check his bag, they find only the roll of tape he used to hide the drugs. When they leave, he takes out his phone, opens the Reddit app, and posts his story to a marijuana forum.
John D. Martin, III, a researcher in the UNC School of Information and Library Science, says this type of story is actually common on drug forums. He and a team of researchers use a data collection technique called web scraping to gather data from two of Reddit’s sub-forums, or “subreddits.” Subreddits’ names come from the part of theURL after the domain, so the forum at reddit.com/r/trees is called “r/trees.” Martin’s team studied “r/trees” and “r/opiates.”
“These are places where people feel comfortable talking to one another because they share some common experience,” Martin says. While the data he obtained revealed a number of possibilities for further research, it also uncovered several ethical questions about online privacy and data collection.
Web scraping formulas scan web pages for particular HTMLtags and pull their content out into a data set. Neal Caren, a sociology professor who created a video series on web scraping, says this is not a highly technical process. “Anyone who knows a little about web coding could learn it in about three hours,” he says.
Asking for a friend
Web scraping is great for analyzing trends through large amounts of text, according to Caren. Often, web scrapers will count the number of times a site used a certain word or phrase, but Martin’s team went further and attached qualities to the different ways users revealed they engaged in illegal activity. “We went into it with the intention of investigating not necessarily why people disclose potentially incriminating information online, but how they did it,” Martin says.
One trend Martin notes among users is what he calls “un-identifying themselves” when asking for advice. Users sometimes preface their requests with the acronym SWIM(“Someone Who Isn’t Me”). Other times, they say they are asking for a friend, even though this is often not the case. “In the opiates forum, it was so apparent that it was a joke,” Martin says. “That in-joke became an avenue through which to discuss these topics.”
Most users seem unconcerned with disclosing their illegal activities, Martin points out. In the example with the man on the train, a few forum posters asked why he would publish that story. His response: “What — do you think the cops are looking on Reddit?”
Reviewing the research
When he submitted the research plan to the only non-biomedical Institutional Review Board on campus (UNC has six IRBs, five of which deal with biomedical research), Martin expected them to supervise the study, since the research involved data from real people. But the board told him it did not fall into the category of human subject research and, therefore, was not subject to oversight.
Elizabeth Kipp-Campbell, director of the Office of Human Research Ethics, says their policies come from federal regulations. A human subject is defined as “a living individual about whom an investigator conducting research obtains data through intervention or interaction with the individual, or identifiable private information.” A ruling that states something is not human subject research is not an indictment of a study, but simply means it is not subject to federal regulations. “Research with real people is a privilege, not a right.” Kipp-Campbell says. “Just because we don’t deem something ‘human subject research’ doesn’t mean the researcher doesn’t have ethical obligations.”
To protect the identities of the forum posters, the researchers do not include usernames in the report, and they changed the wording of some of the posts while keeping the meaning intact. So someone who comes across the study can’t simply copy and paste the text into Google and find the post. Martin understands the board’s decision, since all the information is public, but he is still concerned about what it means in terms of privacy.
“If our institutional checks are not in place and we can just do whatever we want with the data people are putting online, then I guess we can do almost anything,” Martin says. “That’s kind of a scary world.”
Intervening in the forums
Besides sharing stories about drug use, many forum users ask for advice on how to use drugs or how to combine them. Some of the discussion is about recovery from addictions. One user remarked he was in recovery, and that “r/opiates” was actually a trigger for him. “He missed that community,” Martin says. “But a bunch of people were like, ‘If this is a trigger for you, and you’re in recovery, you shouldn’t be here,’” Martin says.
This willingness to share could provide some public health solutions to drug abuse. One way to use this data could be to plant people in these forums to intervene when a discussion approaches physical or legal harm to the poster. Another involves using web scraping as an automated tool for these interventions. “What if we created a bot that looks for indicators in a drug forum where someone is about to combine substances in a way that might kill them?” Martin says.
The success of this strategy would depend largely on how credible the people intervening seem to the forum users, according to Christopher Ringwalt, a senior research scientist at the UNC Injury Prevention Research Center. “That is, if the participants in the drug forum were aware that university based investigators were behind a particular comment,” he explains, “they might discount the message because they distrusted the investigators’ motives or sources of data.”
It would be difficult to follow up with the forum users to see if the strategy was actually working, due to their anonymity, but Ringwalt thinks it’s a good way to reach the target population. “It would have to be subjected to the same kind of testing as any other prevention method,” he says, “but I think novel ideas like this one have potential merit, so let’s try it out and see if it works.”
Law enforcement could use this tool to target people for investigation, Martin points out. Court sentencing recommendations have already moved to algorithms in some places, so to Martin, a world where algorithms look for illegal behavior online is not a far-fetched idea.
“I don’t just think that’s coming. I know it’s coming,” he says.
Although Caren agrees that law enforcement probably already monitors the public’s data, it’s not in such a systematic way at this point. “I highly doubt the Chapel Hill or Durham police departments are monitoring Reddit,” Caren says. “They’re most likely watching a local Facebook feed, not web scraping.”
Protecting privacy
Although many of his colleagues do not use social media because they want to protect their privacy, Martin remains unconcerned about data gathering. Instead, he recommends keeping track of where your data is going and advises against using your real name online, unless it’s for a site like Facebook. He also stresses awareness of location-based data, as many tweets and photos contain information about where they were posted in the metadata.
“This isn’t a time for everybody to freak out and close their Facebook accounts,” Martin says, “but it is a time to be aware of what’s happening when you do anything in these spaces.”