Back to Explorer
Research PaperResearchia:202602.11048[Data Science > Machine Learning]

Features as Rewards: Scalable Supervision for Open-Ended Tasks via Interpretability

Aaditya Vikram Prasad

Abstract

Language models trained on large-scale datasets have been shown to learn features that encode abstract concepts such as factuality or intent. Such features are traditionally used for test-time monitoring or steering. We present an alternative affordance: features as scalable supervision for open-ended tasks. We consider the case of hallucination-reduction as a desirable, yet open-ended behavior and design a reinforcement learning (RL) pipeline, titled RLFR (Reinforcement Learning from Feature Rewards), that uses features as reward functions. Grounded in a novel probing framework that identifies candidate hallucinated claims, our pipeline teaches a model to intervene and correct its completions when it is uncertain of their factuality. Furthermore, the pipeline enables scalable test-time compute, guided once more by our reward features. This end-to-end process operationalized on Gemma-3-12B-IT results in a policy that is 58% less likely to hallucinate compared to the original model, while preserving performance on standard benchmarks. Taken together, by grounding supervision in the language of features, this paper introduces a novel paradigm in the use of interpretability for learning open-ended tasks.


Source: arXiv:2602.10067v1 - http://arxiv.org/abs/2602.10067v1 PDF: https://arxiv.org/pdf/2602.10067v1 Original Link: http://arxiv.org/abs/2602.10067v1

Submission:2/11/2026
Comments:0 comments
Subjects:Machine Learning; Data Science
Original Source:
View Original PDF
arXiv: This paper is hosted on arXiv, an open-access repository
Was this helpful?

Discussion (0)

Please sign in to join the discussion.

No comments yet. Be the first to share your thoughts!

Features as Rewards: Scalable Supervision for Open-Ended Tasks via Interpretability | Researchia | Researchia