Canada: In a groundbreaking development, large language models (LLMs) are poised to transform the landscape of pancreatic cancer diagnosis and treatment planning. Recent research has demonstrated their efficacy in generating automated synoptic reports and accurately categorizing resectability status based on radiological images.
In their study published in Radiology, the researchers revealed that Chat GPT-4 outperforms GPT-3.5 for creating structured, summarized radiology reports for pancreatic ductal adenocarcinoma (PDAC). They found that GPT-4 created near-perfect PDAC synoptic reports from original reports, GPT-4 with chain-of-thought achieved high accuracy in categorizing resectability, and surgeons were more efficient and accurate when they used AI-generated reports.
“The study results are good news for clinicians and patients, as the AI tool could improve surgical decision-making,” Rajesh Bhayana, University of Toronto, ON, Canada, and colleagues wrote.
Pancreatic cancer presents a formidable challenge due to its aggressive nature and often late-stage diagnosis. Accurate assessment of tumor resectability—whether a tumor can be surgically removed—is crucial for determining treatment strategies and patient outcomes. Traditionally, this assessment involves meticulous analysis of radiological scans by trained specialists.
Structured radiology reports for pancreatic ductal adenocarcinoma improve surgical decision-making over free-text reports, but radiologist adoption is variable. Resectability criteria are applied inconsistently. Considering this, the research team aimed to evaluate the performance of LLMs in automatically creating PDAC synoptic reports from original reports and to explore performance in categorizing tumor resectability.
For this purpose, the researchers conducted an institutional review board–approved retrospective study comprising 180 consecutive PDAC staging CT reports on patients referred to the authors’ European Society for Medical Oncology–designated cancer center from January to December 2018. Two radiologists reviewed the reports to establish the reference standard for 14 key findings and the National Comprehensive Cancer Network (NCCN) resectability category.
GPT-3.5 and GPT-4, accessed between September 18 and 29, 2023, were tasked with generating synoptic reports based on original reports using identical 14 features, and their performance was assessed in terms of recall, precision, and F1 score to ensure originality. Three prompting strategies (default knowledge, in-context knowledge, chain-of-thought) were used for both LLMs to categorize resectability.
Hepatopancreaticobiliary surgeons assessed original and artificial intelligence (AI)–-generated reports to evaluate resectability, comparing accuracy and review times.
The researchers reported the following findings:
- GPT-4 outperformed GPT-3.5 in creating synoptic reports (F1 score: 0.997 vs 0.967, respectively).
- Compared with GPT-3.5, GPT-4 achieved equal or higher F1 scores for all 14 extracted features. GPT-4 had higher precision than GPT-3.5 for extracting superior mesenteric artery involvement (100% vs 88.8%, respectively).
- For categorizing resectability, GPT-4 outperformed GPT-3.5 for each prompting strategy.
- For GPT-4, chain-of-thought prompting was most accurate, outperforming in-context knowledge prompting (92% versus 83%, respectively), which outperformed the default knowledge strategy (83% vs 67%).
- Surgeons were more accurate in categorizing resectability using AI-generated reports than original reports (83% vs 76%, respectively), while spending less time on each report (58%).
The findings showed that GPT-4 created near-perfect PDAC synoptic reports from original reports. GPT-4 with chain-of-thought achieved high accuracy in resectability categorization. Surgeons were more efficient and accurate using AI-generated reports.
Reference:
https://doi.org/10.1148/radiol.233117
Source: Chat GPT may prouce structured, summarized radiology reports for pancreatic ductal