Back to Explorer
Research PaperResearchia:202601.29064[Computational Linguistics > NLP]

$G^2$-Reader: Dual Evolving Graphs for Multimodal Document QA

Yaxin Du

Abstract

Retrieval-augmented generation is a practical paradigm for question answering over long documents, but it remains brittle for multimodal reading where text, tables, and figures are interleaved across many pages. First, flat chunking breaks document-native structure and cross-modal alignment, yielding semantic fragments that are hard to interpret in isolation. Second, even iterative retrieval can fail in long contexts by looping on partial evidence or drifting into irrelevant sections as noise accumulates, since each step is guided only by the current snippet without a persistent global search state. We introduce G2G^2-Reader, a dual-graph system, to address both issues. It evolves a Content Graph to preserve document-native structure and cross-modal semantics, and maintains a Planning Graph, an agentic directed acyclic graph of sub-questions, to track intermediate findings and guide stepwise navigation for evidence completion. On VisDoMBench across five multimodal domains, G2G^2-Reader with Qwen3-VL-32B-Instruct reaches 66.21% average accuracy, outperforming strong baselines and a standalone GPT-5 (53.08%).


Source: arXiv:2601.22055v1 - http://arxiv.org/abs/2601.22055v1 PDF: https://arxiv.org/pdf/2601.22055v1 Original Link: http://arxiv.org/abs/2601.22055v1

Submission:1/29/2026
Comments:0 comments
Subjects:NLP; Computational Linguistics
Original Source:
View Original PDF
arXiv: This paper is hosted on arXiv, an open-access repository
Was this helpful?

Discussion (0)

Please sign in to join the discussion.

No comments yet. Be the first to share your thoughts!