AOR: Anatomical Ontology-Guided Reasoning for Medical Large Multimodal Model in Chest X-Ray Interpretation

Qingqiu Li1, Zihang Cui2, Seongsu Bae3, Jilan Xu1, Runtian Yuan1,
Yuejie Zhang1, Rui Feng3, Quanli Shen4, Xiaobo Zhang4, Junjun He5 and Shujun Wang6

1Fudan University    2Xidian University    3KAIST   4Children's Hospital of Fudan University   
5Shanghai AI Laboratory    6Hong Kong Polytechnic University   

πŸ“ TL;DR: By empowering Medical LMMs with anatomy-centric reasoning capabilities, we offer a new paradigm for interactive and explainable LMMs in medical imaging analysis.



Abstract

Chest X-rays (CXRs) are the most frequently performed imaging examinations in clinical settings. Recent advancements in Large Multimodal Models (LMMs) have enabled automated CXR interpretation, enhancing diagnostic accuracy and efficiency. However, despite their strong visual understanding, current Medical LMMs (MLMMs) still face two major challenges: (1) Insufficient regionlevel understanding and interaction, and (2) Limited accuracy and interpretability due to singlestep reasoning. In this paper, we empower MLMMs with anatomycentric reasoning capabilities to enhance their interactivity and explainability. Specifically, we first propose an Anatomical Ontology-Guided Reasoning (AOR) framework, which centers on cross-modal region-level information to facilitate multi-step reasoning. Next, under the guidance of expert physicians, we develop AOR-Instruction, a large instruction dataset for MLMMs training. Our experiments demonstrate AOR's superior performance in both VQA and report generation tasks.


πŸ’‘AOR framework

(a) our AOR framework, which flexibly accommodates both textual and optional visual prompts as input, centered on region-level information to enable multimodal multi-step reasoning and (b) Three-stage training procedure for AOR

πŸ“‘ AOR-Instruction dataset

(a) The construction of AOR-VQA: Anatomical ontologies design β†’ CoT construction β†’ Sample expansion and (b) The construction of AOR-RG: Strict alignment between anatomical region and report sentence.

πŸ–ΌοΈ Results

For the VQA task, AOR is capable of generating correct and logically reasoned answers. For the report generation task, due to the incorporation of fine-grained anatomical regions, AOR demonstrates a stronger grasp of details, such as ET tube, NG tube, and basal atelectasis. Moreover, it can generate corresponding report sentences for specified regions.