ExplorerArtificial IntelligenceAI
Research PaperResearchia:202606.12070

SkMTEB: Slovak Massive Text Embedding Benchmark and Model Adaptation

Marek Šuppa

Abstract

We introduce SkMTEB, the first comprehensive MTEB-style text embedding benchmark for Slovak, a low-resource West Slavic language, comprising 31 datasets across 7 task types -- nearly 4$\times$ the depth of existing multilingual benchmark coverage for Slovak. Our evaluation of 31 embedding models reveals that large instruction-tuned multilingual models achieve the strongest performance, while existing Slovak-specific models trained for NLU tasks transfer poorly to embedding tasks. To address the ...

Submitted: June 12, 2026Subjects: AI; Artificial Intelligence

Description / Details

We introduce SkMTEB, the first comprehensive MTEB-style text embedding benchmark for Slovak, a low-resource West Slavic language, comprising 31 datasets across 7 task types -- nearly 4×\times the depth of existing multilingual benchmark coverage for Slovak. Our evaluation of 31 embedding models reveals that large instruction-tuned multilingual models achieve the strongest performance, while existing Slovak-specific models trained for NLU tasks transfer poorly to embedding tasks. To address the need for efficient, locally-deployable Slovak embeddings, we develop \texttt{e5-sk-small} (45M parameters) and \texttt{e5-sk-large} (365M) by applying vocabulary trimming and fine-tuning to Multilingual E5 models. Despite size reductions of up to 62%, our open-source models achieve competitive performance with proprietary APIs while remaining locally deployable for semantic search and retrieval-augmented generation (RAG). We release the benchmark, models, datasets, and code openly, hoping our approach offers a replicable path for other under-resourced languages.


Source: arXiv:2606.13647v1 - http://arxiv.org/abs/2606.13647v1 PDF: https://arxiv.org/pdf/2606.13647v1 Original Link: http://arxiv.org/abs/2606.13647v1

Please sign in to join the discussion.

No comments yet. Be the first to share your thoughts!

Access Paper
View Source PDF
Submission Info
Date:
Jun 12, 2026
Topic:
Artificial Intelligence
Area:
AI
Comments:
0
Bookmark
SkMTEB: Slovak Massive Text Embedding Benchmark and Model Adaptation | Researchia