
AI Chatbots Aren’t Experts on Psych Medication Reactions — Yet
Asking artificial intelligence (AI) for advice can be tempting. Powered by large language models (LLMs), AI chatbots are available 24/7, are often free to use, and draw on troves of data to answer questions. Now, people with mental health conditions are asking AI for advice when experiencing potential side effects of psychiatric medicines — a decidedly higher-risk situation than asking it to summarize a report.
One question puzzling the AI research community is how AI performs when asked about mental health emergencies. Globally, including in the U.S., there is a significant gap in mental health treatment, with many individuals having limited to no access to mental healthcare. It’s no surprise that people have started turning to AI chatbots with urgent health-related questions.
Now, researchers at the Georgia Institute of Technology have developed a new framework to evaluate how well AI chatbots can detect potential adverse drug reactions in chat conversations, and how closely their advice aligns with human experts. The study was led by Munmun De Choudhury, J.Z. Liang Associate Professor in the School of Interactive Computing, and Mohit Chandra, a third-year computer science Ph.D. student.
“People use AI chatbots for anything and everything,” said Chandra, the study’s first author. “When people have limited access to healthcare providers, they are increasingly likely to turn to AI agents to make sense of what’s happening to them and what they can do to address their problem. We were curious how these tools would fare, given that mental health scenarios can be very subjective and nuanced.”
De Choudhury, Chandra, and their colleagues will introduce their new framework at the 2025 Annual Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics, April 29–May 4.
Putting AI to the Test
Going into their research, De Choudhury and Chandra wanted to answer two main questions: First, can AI chatbots accurately detect whether someone is having side effects or adverse reactions to medication? Second, if they can accurately detect these scenarios, can AI agents then recommend good strategies or action plans to mitigate or reduce harm?
The researchers collaborated with a team of psychiatrists and psychiatry students to establish clinically accurate answers from a human perspective and used those to analyze AI responses.
To build their dataset, they went to the internet’s public square, Reddit, where many have gone for years to ask questions about medication and side effects.
They evaluated nine LLMs, including general purpose models (such as GPT-4o and LLama-3.1), and specialized medical models trained on medical data. Using the evaluation criteria provided by the psychiatrists, they computed how precise the LLMs were in detecting adverse reactions and correctly categorizing the types of adverse reactions caused by psychiatric medications.
Additionally, they prompted LLMs to generate answers to queries posted on Reddit and compared the alignment of LLM answers with those provided by the clinicians over four criteria:
- Emotion and tone expressed
- Answer readability
- Proposed harm-reduction strategies
- Actionability of the proposed strategies
The research team found that LLMs stumble when comprehending the nuances of an adverse drug reaction and distinguishing different types of side effects. They also discovered that while LLMs sounded like human psychiatrists in their tones and emotions — such as being helpful and polite — they had difficulty providing true, actionable advice aligned with the experts.
Better Bots, Better Outcomes
The team’s findings could help AI developers build safer, more effective chatbots. Chandra’s ultimate goals are to inform policymakers of the importance of accurate chatbots and help researchers and developers improve LLMs by making their advice more actionable and personalized.
Chandra notes that improving AI for psychiatric and mental health concerns would be particularly life-changing for communities that lack access to mental healthcare.
“When you look at populations with little or no access to mental healthcare, these models are incredible tools for people to use in their daily lives,” Chandra said. “They are always available, they can explain complex things in your native language, and they become a great option to go to for your queries.
“When the AI gives you incorrect information by mistake, it could have serious implications on real life,” Chandra added. “Studies like this are important, because they help reveal the shortcomings of LLMs and identify where we can improve.”
Funding: National Science Foundation (NSF), American Foundation for Suicide Prevention (AFSP), Microsoft Accelerate Foundation Models Research grant program. The findings, interpretations, and conclusions of this paper are those of the authors and do not represent the official views of NSF, AFSP, or Microsoft.
Photo by Kevin Beasley/College of Computing
As computing revolutionizes research in science and engineering disciplines and drives industry innovation, Georgia Tech leads the way, ranking as a top-tier destination for undergraduate computer science (CS) education. Read more about the college's commitment:… https://t.co/9e5udNwuuD pic.twitter.com/MZ6KU9gpF3
— Georgia Tech Computing (@gtcomputing) September 24, 2024