ExplorerBiotechnologyBiology
Research PaperResearchia:202606.05022

$p$-adic Bi-Filtrations for Topological Machine Learning on Genomic Sequences

Tirtharaj Dash

Abstract

We introduce pVR, a topological machine learning framework for alignment-free genomic sequence classification that combines $p$-adic numbers with topological data analysis. Each DNA sequence is encoded along two complementary axes: a $p$-adic distance on $k$-mer prefixes, which captures hierarchical positional structure, and a compositional $L_1$ distance on $k$-mer frequencies, which captures local sequence content. The two distances jointly parameterise a bi-filtered Vietoris--Rips complex, an...

Submitted: June 5, 2026Subjects: Biology; Biotechnology

Description / Details

We introduce pVR, a topological machine learning framework for alignment-free genomic sequence classification that combines pp-adic numbers with topological data analysis. Each DNA sequence is encoded along two complementary axes: a pp-adic distance on kk-mer prefixes, which captures hierarchical positional structure, and a compositional L1L_1 distance on kk-mer frequencies, which captures local sequence content. The two distances jointly parameterise a bi-filtered Vietoris--Rips complex, and per-sequence topological summaries from this bi-filtration serve as features for standard machine learning classifiers. We establish theoretical guarantees for the construction: stability under metric perturbations and invariance to the choice of prime, alongside a result that explains why a single pp-adic axis is topologically uninformative and why the bi-filtration recovers nontrivial homology. On twelve genomic benchmarks (2828 to 500500 sequences, 33 to 77 classes), pVR outperforms four established alignment-free baselines on three of six low-sample datasets, with gains of up to 2121 percentage points; it underperforms only on a SARS-CoV-2 variant benchmark whose point-mutation divergence violates the hierarchical assumption, and all methods saturate in the large-sample regime. pVR also outperforms zero-shot frozen embeddings from the 500M-parameter Nucleotide Transformer v2 by 6.76.7 to 11.411.4 percentage points on three low-sample benchmarks. The pVR codebase is publicly available at https://github.com/MAHI-Group/pVR.


Source: arXiv:2606.06117v1 - http://arxiv.org/abs/2606.06117v1 PDF: https://arxiv.org/pdf/2606.06117v1 Original Link: http://arxiv.org/abs/2606.06117v1

Please sign in to join the discussion.

No comments yet. Be the first to share your thoughts!

Access Paper
View Source PDF
Submission Info
Date:
Jun 5, 2026
Topic:
Biotechnology
Area:
Biology
Comments:
0
Bookmark
$p$-adic Bi-Filtrations for Topological Machine Learning on Genomic Sequences | Researchia