SPOKANE, Wash. — Doctors are still your best bet to treat cardiovascular issues over artificial intelligence. In a new study conducted at Washington State University, researchers evaluated ChatGPT-4’s ability to assess the risk of a heart attack risk among simulated patients with chest pain. The generative AI system provided inconsistent conclusions and failed to match methods used by doctors to assess a patient’s cardiac risk. Simply put, AI may be able to pass a medical exam, but it can’t replace your cardiologist yet.
Chest pain is one of the most common reasons people end up in the emergency room. Doctors often rely on risk assessment tools like the TIMI and HEART scores to help determine which patients are at high risk of a heart attack and need immediate treatment and which can safely be sent home. These tools take into account factors like the patient’s age, medical history, EKG findings, and blood test results.
In this study, published in the journal PLoS ONE, researchers created three sets of simulated patient data: one based on the variables used in the TIMI score, one based on the HEART score, and a third that included a whopping 44 different variables that might be relevant in a patient dealing with chest pain. They then fed this data to ChatGPT-4 and asked it to calculate a risk score for each “patient.”
The good news? Overall, ChatGPT-4’s risk assessments correlated very well with the tried-and-true TIMI and HEART scores. This suggests that, with the right training, AI language models like ChatGPT have the potential to be valuable tools in helping doctors quickly and accurately assess a patient’s risk.
However, there was a worrying trend beneath the surface. When researchers fed ChatGPT-4 the exact same patient data multiple times, it often spit out very different risk scores. In fact, for patients with a fixed TIMI or HEART score, ChatGPT-4 gave a different score nearly half the time. This inconsistency was even more pronounced in the more complex 44-variable model, where ChatGPT-4 came to a consensus on the most likely diagnosis only 56 percent of the time.
“ChatGPT was not acting in a consistent manner,” says lead study author Dr. Thomas Heston, a researcher with Washington State University’s Elson S. Floyd College of Medicine, in a media release. “Given the exact same data, ChatGPT would give a score of low risk, then next time an intermediate risk, and occasionally, it would go as far as giving a high risk.”
Part of the issue may lie in how language models like ChatGPT-4 are designed. To mimic the variability and creativity of human language, they incorporate an element of randomness. While this makes for more natural-sounding responses, it can clearly be a problem when consistency is key, as it is in medical diagnoses and risk assessments.
Researchers did find that ChatGPT-4 performed better for patients at the low and high ends of the risk spectrum. It was in the medium-risk patients where the AI’s assessments were all over the map. This is particularly concerning, as these are the patients for whom accurate risk stratification is most important in guiding clinical decision-making.
Another red flag was ChatGPT-4’s occasional tendency to recommend inappropriate tests. For example, it sometimes suggested an endoscopy (a procedure to examine the digestive tract) as the first test for a patient it thought might have acid reflux rather than starting with less invasive tests as a doctor would.
“We found there was a lot of variation, and that variation in approach can be dangerous,” explains Dr. Heston. “It can be a useful tool, but I think the technology is going a lot faster than our understanding of it, so it’s critically important that we do a lot of research, especially in these high-stakes clinical situations.”
Researchers suggest a few potential avenues on how to improve ChatGPT-4. One is to tweak the language model to reduce the level of randomness in its responses when analyzing medical data. Another is to train specialized versions of ChatGPT-4 exclusively on carefully curated medical datasets rather than the broad, unfiltered data it’s currently learning from.
Despite the current limitations, researchers remain optimistic about the future of AI in medicine. They propose that tools like ChatGPT-4, with further refinement and in combination with established clinical guidelines, could one day help doctors make faster and more accurate assessments, ultimately leading to better patient care.
“ChatGPT could be excellent at creating a differential diagnosis and that’s probably one of its greatest strengths,” notes Dr. Heston. “If you don’t quite know what’s going on with a patient, you could ask it to give the top five diagnoses and the reasoning behind each one. So it could be good at helping you think through a problem, but it’s not good at giving the answer.”
One thing is clear: we’re not there yet. As impressive as ChatGPT-4 is, this study shows that it’s not ready to be let loose on real patients. Rigorous testing and refining of these AI models is crucial before they can be trusted with the high stakes of medical decision-making. The health and safety of patients must always come first.
StudyFinds’ Matt Higgins contributed to this report.
Source: ChatGPT fails to properly diagnose heart attack risk for patients with chest pain