ExplorerChemistryChemistry
Research PaperResearchia:202606.23040

What Does a Chemical Language Model Know About Molecules?

Christian Kenneth

Abstract

Chemical language models (cLMs) are widely assumed to learn surface-level syntactic patterns rather than learning meaningful molecular semantics. Here, we apply sparse autoencoders (SAEs) to MolFormer, an encoder-only cLM, to mechanistically examine how molecular representations are built across layers. We discover that early layers rely on position-tracking latents to parse molecular grammar, while later layers encode atom-in-substructure and pharmacologically relevant features. Additionally, w...

Submitted: June 23, 2026Subjects: Chemistry; Chemistry

Description / Details

Chemical language models (cLMs) are widely assumed to learn surface-level syntactic patterns rather than learning meaningful molecular semantics. Here, we apply sparse autoencoders (SAEs) to MolFormer, an encoder-only cLM, to mechanistically examine how molecular representations are built across layers. We discover that early layers rely on position-tracking latents to parse molecular grammar, while later layers encode atom-in-substructure and pharmacologically relevant features. Additionally, we show that non-canonical SMILES produce more disruptive representation shifts than invalid SMILES, driven by position-latent disruption propagating across layers. To support further exploration, we develop InterMol, an interactive visualizer for SAE activations on molecular strings and structures.


Source: arXiv:2606.23443v1 - http://arxiv.org/abs/2606.23443v1 PDF: https://arxiv.org/pdf/2606.23443v1 Original Link: http://arxiv.org/abs/2606.23443v1

Please sign in to join the discussion.

No comments yet. Be the first to share your thoughts!

Access Paper
View Source PDF
Submission Info
Date:
Jun 23, 2026
Topic:
Chemistry
Area:
Chemistry
Comments:
0
Bookmark