机构:[1]Peking Univ, Sch Comp Sci, Beijing, Peoples R China[2]Peking Univ, Sch Software Microelectron, Beijing, Peoples R China[3]Peking Univ, Natl Engn Res Ctr Software Engn, Beijing, Peoples R China[4]Med Univ, Xuanwu Hosp Capital, Beijing, Peoples R China首都医科大学宣武医院[5]Peking Univ, Sixth Hosp, Beijing, Peoples R China[6]Peking Univ, Peoples Hosp, Beijing, Peoples R China[7]Peking Univ, First Hosp, Beijing, Peoples R China
出处:
摘要:
Medical visual question answering (MVQA) requires in-depth understanding of medical images and questions to provide reliable answers. We summarize multi-level progressive capabilities that models need to focus on in MVQA: recognition, details, diagnosis, knowledge, and reasoning. Existing MVQA models tend to ignore the above capabilities due to unspecific data and plain architecture. To address these issues, this paper proposes Multi-level Visual Language Model (MLeVLM(1)) for MVQA. On the data side, we construct a high-quality multi-level instruction dataset MLe-VQA via GPT-4, which covers multi-level questions and answers as well as reasoning processes from visual clues to semantic cognition. On the architecture side, we propose a multi-level feature alignment module, including attention-based token selector and context merger, which can efficiently align features at different levels from visual to semantic. To better evaluate the model's capabilities, we manually construct a multi-level MVQA evaluation benchmark named MLe-Bench. Extensive experiments demonstrate the effectiveness of our constructed multi-level instruction dataset and the multi-level feature alignment module. It also proves that MLeVLM outperforms existing medical multimodal large language models.
基金:
National Key RD Program [2021YFF1201100]; Beijing Nova Program; Scientific and Technological Innovation Project of China Academy of Chinese Medical Sciences [CI2023C062YLL]
语种:
外文
WOS:
第一作者:
第一作者机构:[1]Peking Univ, Sch Comp Sci, Beijing, Peoples R China
共同第一作者:
通讯作者:
推荐引用方式(GB/T 7714):
Xu Dexuan,Chen Yanyuan,Wang Jieyi,et al.MLeVLM: Improve Multi-level Progressive Capabilities based on Multimodal Large Language Model for Medical Visual Question Answering[J].FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: ACL 2024.2024,4977-4997.
APA:
Xu, Dexuan,Chen, Yanyuan,Wang, Jieyi,Huang, Yue,Wang, Hanpin...&Huang, Yu.(2024).MLeVLM: Improve Multi-level Progressive Capabilities based on Multimodal Large Language Model for Medical Visual Question Answering.FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: ACL 2024,,
MLA:
Xu, Dexuan,et al."MLeVLM: Improve Multi-level Progressive Capabilities based on Multimodal Large Language Model for Medical Visual Question Answering".FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: ACL 2024 .(2024):4977-4997