详情页 - 首都医科大学宣武医院知识库

当前位置：首页 > 详情页

Intelligent Head and Neck CTA Report Quality Detection with Large Language Models

文献详情

资源类型：

WOS体系：

收录情况： ◇ SCIE

作者：

机构： [1]Capital Med Univ, Xuanwu Hosp, Informat Ctr, Beijing 100053, Peoples R China [2]Capital Med Univ, Xuanwu Hosp, Dept Radiol & Nucl Med, Beijing 100053, Peoples R China [3]Beijing Key Lab Magnet Resonance Imaging & Brain I, Beijing 100053, Peoples R China

出处：

DOI：

ISSN：

关键词： Artificial intelligence Large language models Radiological report Reports quality

摘要：

This study aims to identify common errors in head and neck CTA reports using GPT-4, ERNIE Bot, and SparkDesk, evaluating their potential for supporting quality control in Chinese radiological reports. This study collected 10,000 head and neck CTA imaging reports from Xuanwu Hospital (Dataset 1) and 5000 multi-center reports (Dataset 2). We identified six common types of errors and detected them using three large language models: GPT-4, ERNIE Bot, and SparkDesk. The overall quality of the reports was assessed using a 5-point Likert scale. We conducted a Wilcoxon rank-sum test and Friedman test to compare error detection rates and evaluate the models' performance on different error types and overall scores. For Dataset 2, after manual review, we annotated the six error types and provided overall scoring, while also recording the time taken for manual scoring and model detection. Model performance was evaluated using accuracy, precision, recall, and F1 score. The intraclass correlation coefficient measured consistency between manual and model scores, and ANOVA compared evaluation times. In Dataset 1, the error detection rates for final reports were significantly lower than those for preliminary reports across all three model types. Friedman's test indicated significant differences in error rates among the three models. In Dataset 2, the detection accuracy of the three LLMs for the six error types was above 95%. GPT-4 had a moderate consistency with manual scores (ICC = 0.517), while ERNIE Bot and SparkDesk showed slightly lower consistency (ICC = 0.431 and 0.456, respectively; P < 0.001). The models evaluated one hundred radiology reports significantly faster than human reviewers. LLMs can differentiate the quality of radiology reports and identify error types, significantly enhancing the efficiency of quality control reviews and providing substantial research and practical value in this field.

基金：

语种：

WOS：

第一作者：

第一作者机构： [2]Capital Med Univ, Xuanwu Hosp, Dept Radiol & Nucl Med, Beijing 100053, Peoples R China [3]Beijing Key Lab Magnet Resonance Imaging & Brain I, Beijing 100053, Peoples R China

通讯作者：

通讯机构： [2]Capital Med Univ, Xuanwu Hosp, Dept Radiol & Nucl Med, Beijing 100053, Peoples R China [3]Beijing Key Lab Magnet Resonance Imaging & Brain I, Beijing 100053, Peoples R China

推荐引用方式(GB/T 7714)：

APA：

MLA：

相关文献

[1]Large Language Models in Medicine: Applications, Challenges, and Future Directions [2]Benchmarking Large Language Models for Cervical Spondylosis [3]Application of Large Language Models in Emergency Neurology [4]Artificial intelligence in pediatrics. [5]Multiple large language models versus experienced physicians in diagnosing challenging cases with gastrointestinal symptoms [6]Large Language Models' Responses to Spinal Cord Injury: A Comparative Study of Performance [7]Prospects for the application of artificial intelligence in geriatrics [8]The actual performance of large language models in providing liver cirrhosis-related information: A comparative study [9]Artificial intelligence in healthcare: Past, present and future [10]AIROGS: Artificial Intelligence for Robust Glaucoma Screening Challenge