ExplorerComputational LinguisticsNLP
Research PaperResearchia:202601.29069

When "Better" Prompts Hurt: Evaluation-Driven Iteration for LLM Applications

Daniel Commey

Abstract

Evaluating Large Language Model (LLM) applications differs from traditional software testing because outputs are stochastic, high-dimensional, and sensitive to prompt and model changes. We present an evaluation-driven workflow - Define, Test, Diagnose, Fix - that turns these challenges into a repeatable engineering loop. We introduce the Minimum Viable Evaluation Suite (MVES), a tiered set of recommended evaluation components for (i) general LLM applications, (ii) retrieval-augmented generatio...

Submitted: January 29, 2026Subjects: NLP; Computational Linguistics

Description / Details

Evaluating Large Language Model (LLM) applications differs from traditional software testing because outputs are stochastic, high-dimensional, and sensitive to prompt and model changes. We present an evaluation-driven workflow - Define, Test, Diagnose, Fix - that turns these challenges into a repeatable engineering loop. We introduce the Minimum Viable Evaluation Suite (MVES), a tiered set of recommended evaluation components for (i) general LLM applications, (ii) retrieval-augmented generation (RAG), and (iii) agentic tool-use workflows. We also synthesize common evaluation methods (automated checks, human rubrics, and LLM-as-judge) and discuss known judge failure modes. In reproducible local experiments (Ollama; Llama 3 8B Instruct and Qwen 2.5 7B Instruct), we observe that a generic "improved" prompt template can trade off behaviors: on our small structured suites, extraction pass rate decreased from 100% to 90% and RAG compliance from 93.3% to 80% for Llama 3 when replacing task-specific prompts with generic rules, while instruction-following improved. These findings motivate evaluation-driven prompt iteration and careful claim calibration rather than universal prompt recipes. All test suites, harnesses, and results are included for reproducibility.


Source: arXiv:2601.22025v1 - http://arxiv.org/abs/2601.22025v1 PDF: https://arxiv.org/pdf/2601.22025v1 Original Link: http://arxiv.org/abs/2601.22025v1

Please sign in to join the discussion.

No comments yet. Be the first to share your thoughts!

Access Paper
View Source PDF
Submission Info
Date:
Jan 29, 2026
Topic:
Computational Linguistics
Area:
NLP
Comments:
0
Bookmark