News Blog

How Researchers Scrub Twitter for Health Data from Real Humans — Not Bots


Twitter is noisy — which makes it a perfect tool for bioinformatics experts like Graciela Gonzalez-Hernandez, PhD, who study language to help improve health outcomes. The platform has endless amounts of content posted over time for researchers to track people’s behavior or spot trends in medicine.

The problem is, it’s hard to get to.

“There is tremendous value in social media, but mining it has its challenges,” said Gonzalez-Hernandez, an associate professor of Informatics in the department of Biostatistics, Epidemiology, and Informatics in the Perelman School of Medicine at the University of Pennsylvania, and director of the Health Language Processing Center. “How do you find the right information?”

And by right, she also means real. For all its perks, Twitter is loaded with “bots” that every second of every day push out messages, both nefarious and legitimate, that researchers don’t want to analyze. Bots are software applications that run automated tasks, like generating messages using algorithms or aggregating related content from real accounts and sharing it under another one to perpetuate a particular idea. One study from the University of Southern California found that up to 15 percent of users on Twitter are bots.

“They get retweeted so many times that eventually people start seeing it so much that they start believing it,” Gonzalez-Hernandez said.

But regardless of what other Twitter users might perceive, those bots aren’t churning out human behaviors. As data for measuring human trends, it’s not clean. And to make scientifically sound conclusions or predictions, researchers need the cleanest data possible. That means authentic conversations from people to better understand diseases, birth defects, drug use, and infection rates, among loads of other health topics.

Why Mine Social Media for Health Data?

Today, researchers often turn to surveys to learn about health behaviors, but those have drawbacks. With surveys, responses come from questions proposed. If the researcher did not think of the right question to ask, Gonzalez-Hernandez said, then they may miss something.

“With social media on the other hand,” she added, “we are just listening to what people are saying, without structure. In essence, what we’ve been able to do is capture what’s important to them and not what’s important to me as a researcher.”

For more than 10 years, Gonzalez-Hernandez has been studying natural language across social media to inform clinical care in work that’s funded through the National Library of Medicine and the National Institute of Allergy and Infectious Diseases.

The approach has led her and her team to discover information about pregnant women, including finding comments by mothers with children with birth defects that could potentially help researchers understand the cause behind these defects. One study this year, published in JAMA Open Network in June, also revealed personal experiences and attitudes about statin use, like the adverse effects and the belief it gave people a license to have a poor diet and low physical activity. It’s valuable insight for patient education and physicians looking to better communicate the risks and benefits of certain drugs.

For the statin study, the researchers had to manually annotate more than 12,000 tweets — a relatively low number considering the number of tweets pushed out every day: 500 million.

It’s why working towards more effective social media tools is so important. A larger sample of clean tweets, or any kind of data pulled from social networks, will only help researchers hear real patients’ voices and strengthen the science.

How Do You Automatically Detect Health-Related Bots on Twitter?

When Gonzalez-Hernandez and her team head to Twitter they take a critical eye to ensure those 280 characters come from a person and not a bot. But doing it manually can be time consuming and limit the number of tweets that can be analyzed.

So, Gonzalez-Hernandez turned to the bot detection software commonly used to sniff out political bots, the ones designed to sway elections or promote policies. Since the software is considered to be a successful tool, could it help spot health-related bots, too? That’s anything from anti-vaccine campaigns and COVID-19 conspiracies to more benign accounts, like advertisements that push certain drugs or handles that gather up recent scientific studies.

Not very well, she and her fellow researchers would come to find out.

When the team, which included Anahita Davoudi, PhD, a postdoc at the Perelman School of Medicine, Ari Z. Klein, PhD, a research associate at Penn, and Abeed Sarker, PhD, of Emory University, looked at over 8,000 known bots and non-bots, the detection software, called Botometer, performed poorly. The tool scores bots between 0 and 1, with 1 most likely to be a real person. The average score was 0.361 for non-bots.

They published their findings in May in the American Medical Informatics Association Joint Summits Translational Science Proceedings.

“It’s a different profile,” Gonzalez-Hernandez said. “If you think about it, the messages are different; the way it’s carried out is different. The frequency might be different. There are enough differences that you cannot simply grab that software and then run it on your health data.”

The next step was to build a better version of it. The team fine-tuned the software by using a machine learning algorithm and laying on other features that would home in on health bots. It worked and brought that score up to 0.700 when detecting real humans compared to bot accounts.

“Introducing more features would likely contribute to further enhancing performance, which we will explore in future work,” the authors wrote in the study. It’s a worthy effort, Gonzalez-Hernandez said, if it means getting to the best data possible and ultimately improving the lives of patients.

“There are so many subject areas in health that could benefit from using social media, particularly now,” Gonzalez-Hernandez said. “Having the right methods and right approach will help in making it even more valuable.”


You Might Also Be Interested In...

About this Blog

This blog is written and produced by Penn Medicine’s Department of Communications. Subscribe to our mailing list to receive an e-mail notification when new content goes live!

Views expressed are those of the author or other attributed individual and do not necessarily represent the official opinion of the related Department(s), University of Pennsylvania Health System (Penn Medicine), or the University of Pennsylvania, unless explicitly stated with the authority to do so.

Health information is provided for educational purposes and should not be used as a source of personal medical advice.

Blog Archives


Author Archives

Share This Page: