Back to Explorer
Research PaperResearchia:202601.29110[Cryptography > Cybersecurity]

On the Adversarial Robustness of Large Vision-Language Models under Visual Token Compression

Xinwei Zhang

Abstract

Visual token compression is widely used to accelerate large vision-language models (LVLMs) by pruning or merging visual tokens, yet its adversarial robustness remains unexplored. We show that existing encoder-based attacks can substantially overestimate the robustness of compressed LVLMs, due to an optimization-inference mismatch: perturbations are optimized on the full-token representation, while inference is performed through a token-compression bottleneck. To address this gap, we propose the Compression-AliGnEd attack (CAGE), which aligns perturbation optimization with compression inference without assuming access to the deployed compression mechanism or its token budget. CAGE combines (i) expected feature disruption, which concentrates distortion on tokens likely to survive across plausible budgets, and (ii) rank distortion alignment, which actively aligns token distortions with rank scores to promote the retention of highly distorted evidence. Across diverse representative plug-and-play compression mechanisms and datasets, our results show that CAGE consistently achieves lower robust accuracy than the baseline. This work highlights that robustness assessments ignoring compression can be overly optimistic, calling for compression-aware security evaluation and defenses for efficient LVLMs.


Source: arXiv:2601.21531v1 - http://arxiv.org/abs/2601.21531v1 PDF: https://arxiv.org/pdf/2601.21531v1 Original Link: http://arxiv.org/abs/2601.21531v1

Submission:1/29/2026
Comments:0 comments
Subjects:Cybersecurity; Cryptography
Original Source:
View Original PDF
arXiv: This paper is hosted on arXiv, an open-access repository
Was this helpful?

Discussion (0)

Please sign in to join the discussion.

No comments yet. Be the first to share your thoughts!