Beyond the Hard Budget: Sparsity Regularizers for More Interpretable Top-k Sparse Autoencoders
Abstract
Sparse autoencoders (SAEs) have become a leading tool for interpreting the representations of vision foundation models, decomposing their polysemantic activations into a larger set of sparse, more monosemantic features. The Top-$k$ SAE, a now-standard variant, enforces sparsity architecturally through its activation function, retaining only the $k$ most active latents per input. Because it was designed precisely to avoid the $\ell_1$ penalty used by earlier SAEs and its known drawbacks, it has n...
Description / Details
Sparse autoencoders (SAEs) have become a leading tool for interpreting the representations of vision foundation models, decomposing their polysemantic activations into a larger set of sparse, more monosemantic features. The Top- SAE, a now-standard variant, enforces sparsity architecturally through its activation function, retaining only the most active latents per input. Because it was designed precisely to avoid the penalty used by earlier SAEs and its known drawbacks, it has not been combined with an explicit sparsity regularizer, despite retaining limitations of its own, such as a budget that is fixed regardless of input complexity and a tendency to overfit to the training value of . We introduce two sparsity regularizers compatible with the Top- architecture, both acting on the activations before the Top- selection: an penalty on the unselected (off-support) units, and a scale-invariant -ratio penalty that concentrates the code onto fewer effective units. Both penalties are applied only to the batch-active units, those selected by the Top- operator at least once within the batch. Across two datasets, three vision foundation models, and a range of , both regularizers consistently improve monosemanticity at no cost to reconstruction quality. The penalty further concentrates information into fewer latents, making reconstruction more robust to the inference-time choice of and improving small-budget linear probing. Our central finding is that hard architectural sparsity and soft sparsity regularization are complementary rather than mutually exclusive.
Source: arXiv:2606.27321v1 - http://arxiv.org/abs/2606.27321v1 PDF: https://arxiv.org/pdf/2606.27321v1 Original Link: http://arxiv.org/abs/2606.27321v1
Please sign in to join the discussion.
No comments yet. Be the first to share your thoughts!
Jun 26, 2026
Artificial Intelligence
AI
0