Set-Aggregated Genome Embeddings for Microbiome Abundance Prediction
Abstract
Microbiome functions are encoded within the genes of the community-wide metagenome. A natural question is whether properties of a microbial community can be predicted just from knowing the raw DNA sequences of its members. In this work, we employ set-aggregated genome embeddings (SAGE) to predict community-level abundance profiles, exploiting the few-shot learning capabilities of genomic language models (GLMs). We benchmark this approach to show improved generalization on novel genomes compared ...
Description / Details
Microbiome functions are encoded within the genes of the community-wide metagenome. A natural question is whether properties of a microbial community can be predicted just from knowing the raw DNA sequences of its members. In this work, we employ set-aggregated genome embeddings (SAGE) to predict community-level abundance profiles, exploiting the few-shot learning capabilities of genomic language models (GLMs). We benchmark this approach to show improved generalization on novel genomes compared to classical bioinformatics approaches. Model ablation shows that community-level latent representations directly result in improved performance. Lastly, we demonstrate the benefits of intermediate transformations between latent representations and demonstrate the differences between GLM embedding choices.
Source: arXiv:2605.12286v1 - http://arxiv.org/abs/2605.12286v1 PDF: https://arxiv.org/pdf/2605.12286v1 Original Link: http://arxiv.org/abs/2605.12286v1
Please sign in to join the discussion.
No comments yet. Be the first to share your thoughts!
May 13, 2026
Biotechnology
Biology
0