Back to Explorer
Research PaperResearchia:202602.02086[Biotechnology > Biochemistry]

A Large-Scale Dataset for Molecular Structure-Language Description via a Rule-Regularized Method

Feiyang Cai

Abstract

Molecular function is largely determined by structure. Accurately aligning molecular structure with natural language is therefore essential for enabling large language models (LLMs) to reason about downstream chemical tasks. However, the substantial cost of human annotation makes it infeasible to construct large-scale, high-quality datasets of structure-grounded descriptions. In this work, we propose a fully automated annotation framework for generating precise molecular structure descriptions at scale. Our approach builds upon and extends a rule-based chemical nomenclature parser to interpret IUPAC names and construct enriched, structured XML metadata that explicitly encodes molecular structure. This metadata is then used to guide LLMs in producing accurate natural-language descriptions. Using this framework, we curate a large-scale dataset of approximately 163163k molecule-description pairs. A rigorous validation protocol combining LLM-based and expert human evaluation on a subset of 2,0002,000 molecules demonstrates a high description precision of 98.6%98.6\%. The resulting dataset provides a reliable foundation for future molecule-language alignment, and the proposed annotation method is readily extensible to larger datasets and broader chemical tasks that rely on structural descriptions.


Source: arXiv:2602.02320v1 - http://arxiv.org/abs/2602.02320v1 PDF: https://arxiv.org/pdf/2602.02320v1 Original Article: View on arXiv

Submission:2/2/2026
Comments:0 comments
Subjects:Biochemistry; Biotechnology
Original Source:
View Original PDF
arXiv: This paper is hosted on arXiv, an open-access repository
Was this helpful?

Discussion (0)

Please sign in to join the discussion.

No comments yet. Be the first to share your thoughts!

A Large-Scale Dataset for Molecular Structure-Language Description via a Rule-Regularized Method | Researchia | Researchia