Self-Taught Evaluators: Iterative Improvement Without Human-Labeled Preferences &#8211; Paper Summary

Summary

Large Language Models (LLMs) have rapidly advanced and now serve as the backbone of numerous language-based applications, from open-ended question answering to complex reasoning and code generation. Central to the development and maintenance of high-quality LLMs is the ability to evaluate their outputs effectively. Such evaluation models—often called reward models or evaluators—are crucial at multiple stages: they are used during training as a reward function to align models with human preferences, as well as at inference time to assess and compare generated responses. Typically, these evaluators rely heavily on large quantities of human-labeled preference data. This data is costly and slow to produce. It may become outdated as newer, stronger models emerge.

The paper proposes a novel approach to train such evaluators without any human-labeled preference data. Dubbed the “Self-Taught Evaluator,” this approach uses synthetic training data exclusively, coupled with an iterative refinement procedure. Beginning with an initial seed LLM, the method generates synthetic preference pairs for a large collection of unlabeled user instructions and then uses the model itself to produce judgments of these pairs. These judgments, after a filtering step that ensures correctness, serve as training data to improve the evaluator model. By repeating this process, the evaluator “teaches itself” to judge response quality more effectively. The approach leads to substantial performance gains and, remarkably, matches or outperforms many evaluators trained on large human-labeled datasets. It even surpasses commonly used reference models such as GPT-4 in certain evaluations.

Self-Taught Evaluators Download

Link to Paper

Motivation and Background

The core motivation behind this work arises from the fundamental role evaluation models play in LLM development. Evaluators are needed throughout the LLM lifecycle. For example, in Reinforcement Learning from Human Feedback (RLHF), human judgments are used to guide the model toward preferred responses. More recently, researchers have begun to rely on model-based evaluators to speed up this process and potentially reduce the human annotation load. However, one key limitation remains: most current evaluator or reward models depend on costly human-labeled preference data. Such labels must be collected repeatedly as models improve. This repetition is needed because human annotations from an older generation of responses might not capture distinctions at the cutting-edge of performance. They may lack the detail required for highly nuanced evaluations.

This creates a data maintenance challenge: human annotations can become stale and may fail to accurately reflect the relative quality of new, more capable models. The research question is whether it is possible to bootstrap a high-performing evaluator model. Can this be done without continuous reliance on new human-labeled examples? The paper aims to demonstrate that synthetic data, generated fully by LLMs themselves, can effectively train a high-quality evaluator. The self-improvement loop involves generating synthetic preference pairs and filtering them by the model’s own reasoning steps. Thus, the process is autonomous and does not rely on human annotators.

Related Work

The paper situates itself in a growing body of literature on LLM-based evaluators. Recent trends show that LLMs can serve as “judges,” providing detailed reasoning chains before delivering a final evaluation decision. These evaluators can be applied to open-ended tasks where the notion of correctness is subjective or not easily captured by classic metrics like BLEU or ROUGE. However, such LLM-as-a-Judge solutions have shown variability and room for improvement.

Moreover, using synthetic data for training models is not new. Synthetic data approaches have been explored in many NLP scenarios. These include training language models to solve math problems. They also simulate new conditions or adapt to new tasks. Synthetic preference data generation, while less explored, has begun to show promise in certain benchmarks and for specialized tasks like factuality or safety evaluation. The novelty here is a fully iterative and end-to-end approach. This approach improves a general-purpose evaluator using entirely synthetic preferences. There is no human intervention after providing the initial model and instruction pool.

Methodology

The proposed method targets pairwise evaluation: given a user instruction (which might be a question, command, or multi-turn dialogue), and two candidate responses from models (A and B), the evaluator should choose the better response. This pairwise setup is widely used because it simplifies the evaluation decision to a binary choice. The LLM-as-a-Judge approach prompts the evaluator model with the instruction, the two responses, and a rubric. The model first generates a chain-of-thought reasoning process, then outputs a verdict of which response is better.

The core innovation is how the training data is generated and refined:

Initialization: The starting point is a pool of unlabeled user instructions and a strong instruction-following LLM (the seed model). This large set of instructions can be diverse and unfiltered, containing queries of varying complexity, topics, and difficulty.
Instruction Selection: Given a massive uncurated instruction dataset, the authors first select a subset of instructions that are both challenging and cover relevant skill domains. They use an LLM to classify each instruction into categories (e.g., coding, reasoning, brainstorming) and assign complexity and expected response length. This classification helps create a balanced, focused training set. For example, instructions involving complex reasoning or specialized domains can help produce evaluators that perform better on harder benchmarks.
Response Pair Construction: This step is critical to generating synthetic preferences. For each selected instruction, the model creates a pair of responses (A and B), one intended to be better than the other. The authors devise a clever trick:
By doing so, they form a synthetic preference pair: the first response (original baseline) is preferable to the second response (the answer intended for the modified instruction). No human annotation is needed; the pair is labeled by construction.
Judgment Annotation: With these synthetic preference pairs now available, the next challenge is to produce the training signal for the evaluator model. The current model (LLM-as-a-Judge) is used to generate reasoning chains and verdicts for each pair. The authors know the correct label because one response in the pairs is intentionally better than the other. They sample multiple judgments from the evaluator model. They keep only the verdicts that match the known correct preference. If no correct verdict is produced after multiple sampling attempts, they discard that example.This step effectively filters the automatically generated judgments to ensure correctness. The remaining judgments, complete with reasoning traces, serve as high-quality training data for the next iteration. The key idea is that although the model may initially be imperfect at judging, some fraction of the sampled judgments is correct. Over time, as the model sees more correct reasoning samples, it will internalize these patterns and improve.
Iterative Training: The process is iterative. Starting from the seed model, one obtains synthetic training data (preference pairs and correct reasoning judgments). Then, the model is fine-tuned on this dataset, producing a new and improved LLM-as-a-Judge. In the next iteration, this improved evaluator re-annotates the dataset. It also possibly adds more refined examples. This leads to a virtuous cycle of self-improvement. As the evaluator’s quality improves, it generates more correct judgments for more examples. This gradually expands the training set. It further enhances the model’s evaluative capabilities.

Experiments

The paper conducts a series of experiments using a strong base model, Llama-3-70B-Instruct, as the seed model. The authors assess the trained evaluators on several benchmarks:

RewardBench: A leader-board evaluation framework where models are tested on their ability to judge response pairs across various categories (e.g., chat difficulty, safety, reasoning).
MT-Bench: Another well-known benchmark that tests an evaluator’s agreement with human judgments. It includes a variety of challenging and diverse queries.
HelpSteer2: A dataset contains human-labeled preferences. It focuses on helpfulness and other dimensions of answer quality. It is used to measure how well the evaluator aligns with human-scored data.

The authors compare their Self-Taught Evaluator against:

The original seed model’s evaluation capabilities.
Evaluators trained on large amounts of human-labeled preference data (e.g., from HelpSteer2).
Established strong evaluators like GPT-4.
Reward models that are trained using classifier-based scoring instead of generative chain-of-thought judging.

Key Results

RewardBench: Starting from a base accuracy of about 75.4 on RewardBench (the seed model), the Self-Taught Evaluator improves significantly. After several iterations of self-training, it reaches 88.3 accuracy, and with majority-voting inference (sampling multiple verdicts and choosing the majority), it achieves 88.7. This level of performance matches or surpasses top-performing reward models trained with human-labeled data. For comparison, using 10k human-labeled preferences from HelpSteer2 to train a Llama-based evaluator yields 85.6 accuracy, meaning the synthetic approach outperforms even substantial human supervision. The improvements are most noticeable in more difficult evaluation categories such as Chat Hard, Safety, and Reasoning. This indicates that self-generated data is especially beneficial for challenging evaluation scenarios.
Comparison to GPT-4 on MT-Bench: The Self-Taught Evaluator also performs competitively against GPT-4 when judged on MT-Bench. On non-tie examples, it achieves a human agreement score comparable to that of GPT-4. This suggests that an entirely self-trained evaluator can approach the performance of the current state-of-the-art evaluators. These evaluators rely on human annotations for training.
HelpSteer2 Validation: The iterative training with synthetic data improves both raw accuracy and position-consistent accuracy on the HelpSteer2 validation set. Position-consistent accuracy measures whether the model is stable in its judgments when the order of the two candidate responses is reversed. The Self-Taught Evaluator reduces positional bias and increases reliability. This is another sign that it has genuinely learned consistent evaluation criteria. It has not exploited superficial positional cues.

Ablations and Analyses

The paper includes various ablations to understand the importance of different design choices and to explore the boundaries of the approach:

Data Sources for Synthetic Preferences: The authors try synthetic preferences generated from instructions focusing on different skills, including safety, math (GSM8K), coding, and reasoning. They find that all these domains improve the evaluator’s performance over the baseline. Data from “reasoning” prompts yields particularly strong improvements in the Reasoning category of RewardBench. This suggests that domain-focused synthetic data can be used to specialize or enhance certain evaluation skills.
Alternative Methods for Generating Bad Responses: The main method generates a modified instruction and then a “good” answer to that modified query, which functions as a “bad” answer for the original instruction. Another simpler approach is to directly prompt the LLM to produce a worse answer. Although this alternative still improves evaluation capabilities (to around 80.7 on RewardBench), it is not as effective as the original method (which achieved 83.8 before iteration). Thus, the clever construction of pairs via instruction modification is key to creating more instructive data.
Comparison to Human-Labeled Preferences: Using human-labeled preferences from HelpSteer2 as the initial source of training data leads to improvements, but iterative training on synthetic data alone surpasses that performance. Interestingly, iterative training starting from labeled data and generating synthetic data afterward also yields strong performance. Yet, purely synthetic iterative training is already strong enough, offering a cost-effective and scalable solution.
Combining Synthetic and Human Data: The authors experiment with mixing synthetic and human-labeled data in different ratios. Most configurations yield strong results, and certain mixes slightly improve the final scores. This indicates that if some human-labeled data is available, it can be combined with synthetic preferences. This combination can potentially reach even higher performance levels.
Instruction Complexity and Selection Effects: The paper shows that the curated instructions—those that are categorized as complex and related to reasoning—help create challenging data that leads to better evaluators. Distributions of complexity, expected response length, and categories reveal that focusing on more complex instructions fosters a stronger evaluator. The model must learn to discern subtle differences in quality. These differences only arise in more demanding queries.’

Discussion and Implications

The Self-Taught Evaluator paradigm suggests a new frontier in LLM training and evaluation methodology. The approach lowers the cost by reducing or eliminating the need for human-generated preference data. It also reduces logistical barriers to developing high-quality evaluators. As models improve, the approach also naturally adapts: a stronger model can generate even richer synthetic data and more accurate self-judgments, iteratively boosting the evaluator without any new human intervention.

Another key advantage is the generality of this approach. It can be applied to any domain where the LLM can generate reasonably coherent answers. Unlike human-labeled data, which might be limited to certain domains, synthetic data can be produced on the fly. Human-labeled data might also need to be recollected as the model improves. The evaluator trained in this manner can be used to evaluate other models’ outputs. It can also support automated benchmarking. Additionally, it serves as a reward model for RLHF or related alignment strategies.

There are limitations, of course. For one, the approach relies on the seed model already being fairly capable. If the seed model is poor at understanding instructions or reasoning, generating meaningful pairs of responses that differ in quality becomes challenging. The method also focuses on pairwise evaluation rather than single-response scoring. Additionally, the iterative process might be computationally expensive since it involves repeated cycles of data generation and filtering. Finally, the method’s success depends on the model’s capacity to reason correctly at least some of the time, so that correct judgments can be extracted and reinforced.

Conclusion

The paper introduces the Self-Taught Evaluator, a method to train powerful LLM-as-a-Judge models without relying on costly human-labeled preference datasets. The approach achieves remarkable performance. It does this by carefully constructing synthetic preference pairs. It also uses the model’s own filtered judgments for iterative training. It equals or surpasses the capabilities of top-tier evaluators that rely on large amounts of human annotation. These results create a new opportunity for scalable, autonomous improvement of LLM evaluation models. They demonstrate that self-improvement loops can reduce dependency on human labelers. This approach accelerates the pace of LLM development.

This represents an important step toward more sustainable and adaptable LLM evaluation pipelines. In future work, we might explore the minimal requirements for the seed model’s competence. We might also enhance methods to ensure stable convergence. Additionally, there may be expansions to single-response grading rather than pairwise comparisons. Nonetheless, the success of Self-Taught Evaluator proves the power of iterative self-improvement via synthetic data. It effectively advances the state of the art in LLM-based evaluation.

For AI founders and marketers

Want your AI product explained to a large AI-native audience?

Kingy AI helps AI companies turn complex products into clear, useful YouTube videos that drive awareness, product understanding, demos, clicks, and search visibility.

Get a Sponsorship Fit Review Calculate Sponsored Video ROI See Client Examples