

Self‑Rewarding Language Models
This paper introduces Self-Rewarding Language Models, where large language models iteratively generate, evaluate, and optimize their own outputs without relying on external reward models—establishing a new paradigm of self-alignment and performance improvement.

CartaNova
Jul 7, 2025
Authors: Weizhe Yuan, Richard Yuanzhe Pang, Kyunghyun Cho, Xian Li, Sainbayar Sukhbaatar, Jing Xu, Jason Weston (Meta & NYU) arxiv.org+15arxiv.org+15arxiv.org+15
Core Idea
Instead of relying on a pre-trained, static reward model (as in traditional RLHF or DPO), this approach enables an LLM to judge its own outputs and reward itself through an iterative process called LLM‑as‑a‑Judge. The model effectively becomes both the actor and the critic, evolving through cycles of self-assessment and alignment.
Workflow
Initialization: Start with a seed model fine-tuned on existing instruction-following data (IFT) and optionally some reward‑based examples (EFT).
Self‑Instruction Creation: The model generates new prompts and answers, then evaluates its own responses to build a preference dataset.
Preference‑based Training: Using Direct Preference Optimization (DPO), the model is retrained based on these self-judged preferences. Repeat → improved performance & reward understanding.
This iterative cycle allows the model to continually refine both its output quality and its own reward function researchgate.net+8arxiv.org+8arxiv.org+8reddit.com+2arxiv.org+2arxiv.org+2.
Results
Fine-tuning LLaMA 2 70B with three iterations of self-rewarding significantly outperformed top models like Claude 2, Gemini Pro, and GPT‑4 (0613 version) on the AlpacaEval 2.0 benchmark reddit.com+4arxiv.org+4arxiv.org+4.
Demonstrates that a model can transcend limitations imposed by static human-labeled reward signals.
Significance
Introduces a self-improving feedback loop that reduces dependency on expensive human annotations.
Offers a new path toward superhuman agent performance by enabling the model to improve its own reward mechanism alongside its response quality.