ExplorerData ScienceMachine Learning
Research PaperResearchia:202604.01062

Tracking Equivalent Mechanistic Interpretations Across Neural Networks

Alan Sun

Abstract

Mechanistic interpretability (MI) is an emerging framework for interpreting neural networks. Given a task and model, MI aims to discover a succinct algorithmic process, an interpretation, that explains the model's decision process on that task. However, MI is difficult to scale and generalize. This stems in part from two key challenges: there is no precise notion of a valid interpretation; and, generating interpretations is often an ad hoc process. In this paper, we address these challenges by d...

Submitted: April 1, 2026Subjects: Machine Learning; Data Science

Description / Details

Mechanistic interpretability (MI) is an emerging framework for interpreting neural networks. Given a task and model, MI aims to discover a succinct algorithmic process, an interpretation, that explains the model's decision process on that task. However, MI is difficult to scale and generalize. This stems in part from two key challenges: there is no precise notion of a valid interpretation; and, generating interpretations is often an ad hoc process. In this paper, we address these challenges by defining and studying the problem of interpretive equivalence: determining whether two different models share a common interpretation, without requiring an explicit description of what that interpretation is. At the core of our approach, we propose and formalize the principle that two interpretations of a model are equivalent if all of their possible implementations are also equivalent. We develop an algorithm to estimate interpretive equivalence and case study its use on Transformer-based models. To analyze our algorithm, we introduce necessary and sufficient conditions for interpretive equivalence based on models' representation similarity. We provide guarantees that simultaneously relate a model's algorithmic interpretations, circuits, and representations. Our framework lays a foundation for the development of more rigorous evaluation methods of MI and automated, generalizable interpretation discovery methods.


Source: arXiv:2603.30002v1 - http://arxiv.org/abs/2603.30002v1 PDF: https://arxiv.org/pdf/2603.30002v1 Original Link: http://arxiv.org/abs/2603.30002v1

Please sign in to join the discussion.

No comments yet. Be the first to share your thoughts!

Access Paper
View Source PDF
Submission Info
Date:
Apr 1, 2026
Topic:
Data Science
Area:
Machine Learning
Comments:
0
Bookmark
Tracking Equivalent Mechanistic Interpretations Across Neural Networks | Researchia