ExplorerData ScienceMachine Learning
Research PaperResearchia:202606.19076

Scalable Training of Spatially Grounded 2D Vision-Language Models for Radiology

Yusuf Salcan

Abstract

We study how to train visually grounded vision-language models (VLMs) for radiology without manual spatial annotations. We introduce RefRad2D, a large-scale bilingual (German/English) dataset of 1.2M CT and MR image-text pairs derived from clinical practice, with task-specific VQA and spatial grounding subsets generated automatically via LLM-based curation and automated segmentation. Trained on this data, our model RadGrounder jointly performs report generation, visual question answering, and sp...

Submitted: June 19, 2026Subjects: Machine Learning; Data Science

Description / Details

We study how to train visually grounded vision-language models (VLMs) for radiology without manual spatial annotations. We introduce RefRad2D, a large-scale bilingual (German/English) dataset of 1.2M CT and MR image-text pairs derived from clinical practice, with task-specific VQA and spatial grounding subsets generated automatically via LLM-based curation and automated segmentation. Trained on this data, our model RadGrounder jointly performs report generation, visual question answering, and spatial grounding via bounding-box detection or segmentation. On external VQA benchmarks (Slake, VQA-RAD), RadGrounder achieves competitive results with specialized medical VLMs. Adding our clinical data to the training mixture improves open-ended VQA over fine-tuning on the downstream datasets alone, showing the transferability of our dataset. Crucially, adding grounding supervision does not degrade language quality, enabling spatially verifiable outputs at no cost to VQA performance.


Source: arXiv:2606.20477v1 - http://arxiv.org/abs/2606.20477v1 PDF: https://arxiv.org/pdf/2606.20477v1 Original Link: http://arxiv.org/abs/2606.20477v1

Please sign in to join the discussion.

No comments yet. Be the first to share your thoughts!

Access Paper
View Source PDF
Submission Info
Date:
Jun 19, 2026
Topic:
Data Science
Area:
Machine Learning
Comments:
0
Bookmark
Scalable Training of Spatially Grounded 2D Vision-Language Models for Radiology | Researchia