How can I publish a research paper for free?

On Researchia, you can publish research papers, preprints, and science projects instantly and for free — no paywall and no submission fee. Create a free account, go to Explorer, and click "Publish Instantly" to share your work with a global audience.

Where can I find trending research papers?

Researchia Explorer aggregates the latest and most-discussed research papers across AI, Biology, Physics, Engineering, and more. New papers are added daily and ranked by community engagement.

What is a good free alternative to ResearchGate for publishing papers?

Researchia is a free, modern alternative to ResearchGate. You can publish papers instantly, connect with researchers, collaborate on projects, and access an open library of 200M+ scientific records — all without paywalls.

Research PaperResearchia:202604.16075

Don't Show Pixels, Show Cues: Unlocking Visual Tool Reasoning in Language Models via Perception Programs

Muhammad Kamran Janjua

Abstract

Multimodal language models (MLLMs) are increasingly paired with vision tools (e.g., depth, flow, correspondence) to enhance visual reasoning. However, despite access to these tool-generated visual cues, MLLMs often fail to benefit from them. Existing approaches typically feed raw tool outputs into the model, but these dense, pixel-level representations are misaligned with the language-native reasoning strengths of LLMs, leading to weak perception and reliance on language priors. We argue that, i...

Submitted: April 16, 2026Subjects: Machine Learning; Data Science

Description / Details

Multimodal language models (MLLMs) are increasingly paired with vision tools (e.g., depth, flow, correspondence) to enhance visual reasoning. However, despite access to these tool-generated visual cues, MLLMs often fail to benefit from them. Existing approaches typically feed raw tool outputs into the model, but these dense, pixel-level representations are misaligned with the language-native reasoning strengths of LLMs, leading to weak perception and reliance on language priors. We argue that, in problems where vision tools can provide the necessary visual cues, the bottleneck is not more tool calls or larger MLLMs, it is how tool outputs are represented. We introduce Perception Programs (P $^2$ ), a training-free, model-agnostic method that rewrites tool outputs into compact, structured, language-native summaries that MLLMs can directly parse and reason over. Across six perception-centric tasks in BLINK, P $^2$ consistently yields large improvements over base models and raw tool-augmented baselines. With GPT-5 Mini as the base model, P $^2$ raises its accuracy from 41.35% to 86.47% on multi-view reasoning, from 52.42% to 81.45% on relative depth, and achieves a 22% average gain across tasks, setting new state-of-the-art results. Even on smaller MLLMs, e.g., InternVL3.5-4B and Qwen3VL-4B, we observe 15-40% absolute gains from P $^2$ , surpassing prior agentic, supervised, and RL-based tool-use methods-without any training or model modifications.

Source: arXiv:2604.12896v1 - http://arxiv.org/abs/2604.12896v1 PDF: https://arxiv.org/pdf/2604.12896v1 Original Link: http://arxiv.org/abs/2604.12896v1

Please sign in to join the discussion.

No comments yet. Be the first to share your thoughts!