ExplorerComputer VisionComputer Vision
Research PaperResearchia:202606.15008

RATS! Patches Talk Through Registers: Emergent Parts in Register Attention Transformers

Timing Yang

Abstract

When humans see a bird, they recognize far more than just "bird" -- they see a head, wings, and talons, a structured assembly of reusable parts that can be identified across every bird they have ever seen. We ask whether a self-supervised visual model can discover the same compositional structure on its own. To this end, we propose RATS (Register Attention Transformers), which decomposes the classification token into N learnable register tokens that route patch information through an L->N->N->L ...

Submitted: June 15, 2026Subjects: Computer Vision; Computer Vision

Description / Details

When humans see a bird, they recognize far more than just "bird" -- they see a head, wings, and talons, a structured assembly of reusable parts that can be identified across every bird they have ever seen. We ask whether a self-supervised visual model can discover the same compositional structure on its own. To this end, we propose RATS (Register Attention Transformers), which decomposes the classification token into N learnable register tokens that route patch information through an L->N->N->L bottleneck via a three-step compress-communicate-broadcast attention. The N registers are partitioned across the H attention heads, so that registers assigned to different heads do not interact with each other. Without auxiliary losses or part annotations, each register spontaneously specializes into a proto-semantic region whose emerging structure resembles object parts. RATS surpasses all baselines by +12 mIoU on average across five segmentation benchmarks, with consistent gains on ADE20K (+1.11 mIoU) and COCO (+0.2 AP^m). Its register dictionary further exhibits part-level consistency and semantic proximity across related categories. Our results suggest that RATS may provide a useful architectural prior for structured and interpretable visual representation learning.


Source: arXiv:2606.14701v1 - http://arxiv.org/abs/2606.14701v1 PDF: https://arxiv.org/pdf/2606.14701v1 Original Link: http://arxiv.org/abs/2606.14701v1

Please sign in to join the discussion.

No comments yet. Be the first to share your thoughts!

Access Paper
View Source PDF
Submission Info
Date:
Jun 15, 2026
Topic:
Computer Vision
Area:
Computer Vision
Comments:
0
Bookmark
RATS! Patches Talk Through Registers: Emergent Parts in Register Attention Transformers | Researchia