ExplorerBiotechnologyBiology
Research PaperResearchia:202606.30019

DNA Language Models: An Assessment of Pre-Training for Fine-Tuning Tasks

Romain Karpinsky

Abstract

Recent breakthroughs in foundation models and Large Language Models (LLMs) have introduced new opportunities for studying and decoding genomic sequences. Several state-of-the-art approaches, such as DNABERT2, rely on transformer-based architectures, while others, such as ConvNova, still build upon more conventional convolutional models. However, systematic benchmark comparisons across these methods remain scarce. Given that transformer-based models require extensive and costly pretraining, it is...

Submitted: June 30, 2026Subjects: Biology; Biotechnology

Description / Details

Recent breakthroughs in foundation models and Large Language Models (LLMs) have introduced new opportunities for studying and decoding genomic sequences. Several state-of-the-art approaches, such as DNABERT2, rely on transformer-based architectures, while others, such as ConvNova, still build upon more conventional convolutional models. However, systematic benchmark comparisons across these methods remain scarce. Given that transformer-based models require extensive and costly pretraining, it is crucial to evaluate whether their performance gains justify this overhead. Moreover, LLMs such as DNABERT2 typically rely on Byte Pair Encoding (BPE) tokenization, whose relevance for DNA sequence representation is still debated within the genomics community. In this work, we investigate three key questions: (i) do transformer-based models provide sufficient improvements on fine-tuning tasks upon heavy pretraining, (ii) what is the actual contribution of pretraining in this setting, and (iii) how does BPE tokenization impact performance on genomics-related tasks?


Source: arXiv:2606.30140v1 - http://arxiv.org/abs/2606.30140v1 PDF: https://arxiv.org/pdf/2606.30140v1 Original Link: http://arxiv.org/abs/2606.30140v1

Please sign in to join the discussion.

No comments yet. Be the first to share your thoughts!

Access Paper
View Source PDF
Submission Info
Date:
Jun 30, 2026
Topic:
Biotechnology
Area:
Biology
Comments:
0
Bookmark
DNA Language Models: An Assessment of Pre-Training for Fine-Tuning Tasks | Researchia