ExplorerComputer VisionComputer Vision
Research PaperResearchia:202607.03007

PointDiT: Pixel-Space Diffusion for Monocular Geometry Estimation

Haofei Xu

Abstract

State-of-the-art single-image 3D reconstruction methods often rely on complex hybrid architectures and loss functions, or compress geometry into latent spaces in order to leverage pre-trained latent diffusion models. In this work, we show that such architectural overhead and intricate loss formulations are unnecessary. We introduce a minimalist pixel-space Diffusion Transformer, built on a plain ViT, that operates directly on raw 3D point map patches and is conditioned on image tokens from a pre...

Submitted: July 3, 2026Subjects: Computer Vision; Computer Vision

Description / Details

State-of-the-art single-image 3D reconstruction methods often rely on complex hybrid architectures and loss functions, or compress geometry into latent spaces in order to leverage pre-trained latent diffusion models. In this work, we show that such architectural overhead and intricate loss formulations are unnecessary. We introduce a minimalist pixel-space Diffusion Transformer, built on a plain ViT, that operates directly on raw 3D point map patches and is conditioned on image tokens from a pre-trained DINOv3. Unlike existing latent diffusion approaches, we train our diffusion backbone entirely from scratch, eliminating the need for point map tokenizers. Despite its simplicity, our approach surpasses complex latent-based diffusion models while remaining significantly simpler than hybrid alternatives. Notably, it produces sharper geometric structure and is more robust in highly ambiguous regions, such as transparent objects.


Source: arXiv:2607.02515v1 - http://arxiv.org/abs/2607.02515v1 PDF: https://arxiv.org/pdf/2607.02515v1 Original Link: http://arxiv.org/abs/2607.02515v1

Please sign in to join the discussion.

No comments yet. Be the first to share your thoughts!

Access Paper
View Source PDF
Submission Info
Date:
Jul 3, 2026
Topic:
Computer Vision
Area:
Computer Vision
Comments:
0
Bookmark
PointDiT: Pixel-Space Diffusion for Monocular Geometry Estimation | Researchia