Teaching Language Models to Critique via Reinforcement Learning - Summary

Within the rapidly evolving terrain of artificial intelligence research, the notion of cultivating Large Language Models (LLMs) to self-improve and self-correct has ignited considerable fascination. The paper titled “Teaching Language Models to Critique via Reinforcement Learning” undertakes a comprehensive study of the intricacies behind training LLMs to generate informed, actionable critiques for code generation tasks. Through an extensive methodology that unifies execution-guided supervision, multi-stage optimization, and reinforcement learning, the authors propose a new framework called CTRL (Critic Training via Reinforcement Learning). By centering on the domain of code generation—which naturally presents objective correctness evaluations via test cases—the authors demonstrate that LLM-based critics, once trained to produce both discriminative and instructive feedback, can dramatically enhance the capabilities of even stronger generator models. They further illustrate that these critic models can generalize beyond their own scale, thereby enabling systems that exhibit an emergent “weak-to-strong” supervision phenomenon.

This summary will traverse the paper’s central motivations, architectural underpinnings, experimental procedures, empirical findings, and evolutionary significance. Furthermore, it will incorporate relevant links sprinkled throughout the discussion. The objective is to provide a robust conceptual overview—one that conveys the authors’ main contributions in a thorough yet sometimes kaleidoscopic manner.

2502.03492v1 Download

Motivations and Problem Context

The paper’s impetus arises from the observation that current Large Language Models, although remarkably capable in generating code and other textual artifacts, frequently stumble in producing consistent, logically precise, or executable solutions without oversight or iterative guidance. The authors highlight how, despite LLMs’ formidable generative prowess, the domain of critiquing remains underdeveloped. Iterative refinement, in which a model presents a draft solution and then re-evaluates it—either by itself or with an external critic—appears a promising avenue for performance improvement. However, if the “feedback bottleneck” remains ill-defined, such self-improvement processes can lead to performance stagnation or even degradation. Indeed, as indicated by Huang et al., 2023, purely self-reflective loops may spiral into confusion, repeating the same mistakes ad infinitum.

Critiquing involves two core capacities: (1) discrimination, the ability to ascertain whether a solution is correct or incorrect; and (2) the capacity to offer cogent, step-by-step feedback that can ameliorate mistakes in a subsequent revision. The authors underline the tension between merely detecting errors and guiding a fix. In certain code-generation contexts, reward models compress complex criteria into simple numeric signals, while a “verification” approach might simply yield raw traces (e.g., error logs) unsuited to direct high-level improvements. Contrastingly, a specialized critic can, in principle, produce potent guidance that fosters more dramatic enhancements with each iteration.

The paper situates itself at the intersection of self-improvement approaches—like Reflexion (Shinn et al., 2024) and Self-Refine (Madaan et al., 2024)—and discrimination frameworks reliant on reward models. The authors then propose that a decoupled critic, thoroughly trained to deliver direct textual feedback to a generator model, can produce feedback that bridges the gulf between raw correctness checks (e.g., “your code has an error in test case 3”) and truly helpful commentary (e.g., “you misused the heap data structure, and should switch to a max-heap approach to maintain the k-th smallest value efficiently”). By connecting this approach to the realm of code generation, the paper capitalizes on the clarity of automated test-based correctness signals, thus offering a self-contained environment to evaluate iterative improvements.

Language Models to Critique via Reinforcement Learning

Relevance of Code-Generation as a Test Bed

Code generation is an appealing test domain because it provides a direct, unambiguous measure of correctness: namely, does the proposed program pass the given test suite? The paper complements prior work in code generation technologies (Li et al., 2022, Sun et al., 2024), underscoring that while LLMs can produce partially correct or even workable solutions, it remains beneficial to systematically refine them across multiple critique-revision cycles.

Through these cycles, certain subtle improvements often arise. For instance, an LLM might first propose a naive or incomplete solution, then a critic identifies structural pitfalls or inefficiencies such as using a min-heap incorrectly, or conflating string indexing logic in code. If the critic can articulate precisely why the solution was incorrect—using natural language that references the domain constraints, data structures, or code syntax—the next iteration of the generator stands a better chance of producing a fully correct, or at least more advanced, version of the solution. This resonates with software engineering best practices, wherein code review processes often revolve around precisely diagnosing and rectifying mistakes.

Key Observations Leading to CTRL

The authors’ impetus for creating CTRL emerges from four cardinal insights:

Limited Self-Critique Performance: Sole reliance on the same LLM that generated the code to critique itself yields limited improvements, especially in complex scenarios. Their experiments indicate a self-critique approach that references raw execution logs can help, but only to a point.
Decoupled Critic Benefits: When the critic is not the same model as the solution generator, it can more reliably identify issues and provide suggestions. This is reminiscent of collaborative workflows in which distinct agents assume specialized roles.
Reinforcement Learning Over Critique Space: Simple supervised finetuning from static sets of high-quality feedback does not fully capture the vastness of potential critique strategies. The authors argue that critiques occupy a large, high-variance space; hence, direct optimization of the textual feedback to maximize correction success is key.
Weak-to-Strong Generalization: Fascinatingly, many of the authors’ experiments point to a scenario in which a comparatively weaker critic—finetuned meticulously—can guide a stronger, more capable generator. This generalization stands in contrast to naive intuition that a weaker critic might hamper a stronger generator, emphasizing the synergy achievable under carefully structured training.

Theoretical Underpinnings and Markov Chain Model

In building up to their method, the authors use a Markov chain formalism to illustrate how iterative refinement can be seen as repeated transitions from one solution state to another across multiple attempts:

The probability of an incorrect solution transitioning to a correct solution, denoted pcwp_{cw}pcw, captures how improvements happen when errors exist.
The probability that a correct solution remains correct, denoted pccp_{cc}pcc, represents the critic’s ability to avoid introducing new mistakes—dubbed “error compounding.”

By simulating these transitions under various assumptions (strong vs. weak critiquing, strong vs. weak discrimination), they highlight that robust critiquing significantly raises success rates over repeated attempts. In effect, a well-tuned critic can systematically shift an LLM’s iterative attempts closer to correctness, not just by randomly sampling solutions but by systematically learning from repeated failures in a guided manner.

The CTRL Framework: An Overview

The paper’s central contribution is CTRL (Critic Training via Reinforcement Learning), an end-to-end pipeline with two primary training stages:

Stage I – Execution-Guided Critique Synthesis
In this phase, the authors create high-quality, diverse textual critiques using execution feedback from a “sandbox” environment. Whenever a code snippet fails or partially fails tests, they automatically gather relevant error logs, partial stack traces, or failing test case details. These are then fed into a “hint” mechanism, guiding the generation of critiques that localize the error in plain language.
By excluding direct references to raw logs in the final critique, the authors push the LLM to internalize the error reasoning without copying ephemeral details. The product is a curated set of problem-solution-critique triplets gleaned from real execution outcomes, forming the dataset for supervised finetuning (SFT). Although SFT alone yields notable improvements in discrimination (the ability to label solutions as correct or incorrect), the authors find it insufficient for consistently high-level critique generation.
Stage II – Reinforced Critique Generation
Following SFT, the model is refined further using a variant of Group Relative Policy Optimization (Shao et al., 2024) to directly optimize how critiques lead to better revised solutions. Within this second stage, the authors circumvent the challenges of building stable value networks (as in PPO) by grouping multiple critiques per problem-solution instance and computing advantage signals that measure whether the solution passes tests after revision. By normalizing these signals across critiques, they reduce the variance that typically hobbles straightforward policy gradient approaches.

Discrimination vs. Critiquing

The learned critic does more than just label solutions as correct or not. It must also produce textual feedback that spurs a subsequent improvement. In the authors’ words, the space of helpful critiques is vast, and only a subset meaningfully fosters better solutions. Hence, the paper frames the problem as:max⁡θ Ez∼D×π, y∼πθ(⋅∣z)[R(y)]\max_\theta \, \mathbb{E}_{z \sim D \times \pi, \, y \sim \pi_\theta(\cdot \mid z)} [R(y)] θmaxEz∼D×π,y∼πθ(⋅∣z)[R(y)]

where θ\thetaθ parameterizes the critic, zzz is the problem-solution pair, R(y)R(y)R(y) is a binary reward (pass/fail), and ccc is the textual feedback. The critic’s job is to produce ccc, which modifies the generator’s distribution for the subsequent revision. In multiple-turn expansions, the process can be repeated, with each iteration culminating in either an eventually passing snippet or a stable incorrectness that hopefully gets rectified in subsequent tries.

Linking back to the earlier Markov model, a better critic aims to maximize pcw\ p_{cw} pcw and pcc\ p_{cc} pcc, i.e., to swiftly convert incorrect solutions into correct ones while avoiding the regression of correct solutions into broken states. The authors underscore that random textual commentary might yield minimal improvements, so the RL-based approach is crucial in systematically discovering high-leverage feedback patterns.

Experimental Setting and Benchmarks

To validate CTRL’s effectiveness, the authors employ a range of code-generation benchmarks:

CodeContests (Li et al., 2022): A challenging dataset of competitive programming tasks.
LiveCodeBench (Jain et al., 2024): A curated set of recent coding problems with minimal data contamination.
MBPP+ (Liu et al., 2024a extended): A fundamental code generation dataset.
JudgeBench (Tan et al., 2024): A broader domain benchmark for generative reward models that queries whether the critic can accurately discriminate solution quality across various tasks.

For the generator, they mostly use Qwen2.5-Coder (or a similarly sized model), while for the critic they experiment with different fine-tuned variants: a baseline self-critique approach, GPT-4-based critics, and their newly proposed CTRL critic. By surveying these different critic-generator pairings, they examine the synergy among model scales, illustrating that even a weaker critic can steer a more adept generator, consistent with prior “weak-to-strong” oversight insights (Christiano et al., 2018).

Major Findings

Substantial Improvements in Pass@1: Across CodeContests, pairing the generator with a CTRL critic prompts the single-turn Pass@1 accuracy to rise significantly relative to a naïve zero-shot baseline. When the critique-revision loop is extended over multiple iterations, improvements accumulate, in some cases surpassing a relative gain of 106.1% for the base model.
Mitigating Error Compounding: The authors track how frequently correct solutions become incorrect after a revision (i.e., the regression rate), discovering that CTRL drastically reduces such error propagation relative to other baseline critics.
Generalization to Stronger Generators: Remarkably, the paper shows that the CTRL critic, trained with a Qwen2.5-Coder generator, can successfully guide GPT-4o, a much stronger model. This reaffirms the possibility that a properly refined critic transcends the capacity of its own generative baseline, a phenomenon the authors deem “weak-to-strong generalization.”
Enhanced Discrimination: Empirical results in JudgeBench highlight that CTRL’s textual feedback mechanism can serve effectively as a generative reward model. It classifies correct vs. incorrect solutions with near state-of-the-art accuracy, and it can also produce justifications that indicate how it arrived at that classification.
Trade-Off of Execution Time: On certain datasets like LiveCodeBench, solutions guided by CTRL critiques occasionally time out since the revised solutions may undergo more meticulous or comprehensive logic, leading to longer average runtimes. Even so, the net correct solutions (Pass@1 rate) remain higher, a testament to the improved reliability.

Ablations and Analysis

The paper presents several ablation studies. One notable experiment compares self-critique performance to that of execution feedback directly plugged into the generator. Self-critique using raw execution feedback does yield mild improvements, but remains inferior to the carefully guided critiques that rely on the LLM’s capacity to interpret, reorganize, and propose precise modifications. Furthermore, a side-by-side comparison with standard Proximal Policy Optimization (PPO) reveals that naive value-network-based credit assignment is unstable for critique generation, thereby justifying the authors’ reliance on Group Relative Policy Optimization. This approach mitigates variance by evaluating multiple critiques at once, standardizing their immediate success rates, and only then applying policy gradient updates.

Additionally, the authors measure the similarity between the original code snippet and the revised snippet to ascertain whether the critic fosters minor local changes or encourages deeper transformations. The CTRL-based critic fosters significantly lower similarity scores than self-critique methods, suggesting that it is more likely to propose bold, structural modifications. These deeper changes often yield bigger performance leaps for tasks requiring complex logic.

Iterative Refinement for Code and Beyond

Though the paper anchors itself in code generation, the authors briefly note that the unified textual feedback approach could be generalized to other tasks that have well-defined correctness checks. For instance, if dealing with mathematical proofs or step-by-step question-answering in advanced reading comprehension, the system would rely on a “verification environment” of some kind. The same principles of producing discrete, actionable feedback in a textual form may well apply. As soon as a domain has a robust evaluation method—an automated metric or oracle—the approach of pairing a specialized critic with a generator, and reinforcing the critic’s feedback to optimize final outcomes, becomes feasible.

This resonates with the broader theme of “scalable oversight” in AI safety, where less capable systems are trained to successfully critique or rectify the outputs of more powerful systems. In effect, this could anchor safer or more trustworthy AI pipelines, a vision consistent with older proposals of debate-style frameworks (Irving et al., 2018). The authors, however, remain focused on the immediate, demonstration-laden realm of code generation, presumably because it yields a straightforward vantage point for such iterative improvements.

Limitations

The authors do acknowledge certain caveats and limitations:

Dependence on Execution Feedback: While code tasks supply a straightforward pass/fail signal, not all domains offer such a conveniently verifiable metric. The system’s training pipeline might be less straightforward to replicate in tasks where partial correctness is tricky to define or test automatically.
Computational Overheads: Running multiple critiques and repeated solution attempts can be computationally expensive, particularly on large-scale benchmarks. This overhead includes the time for generating critiques, revising solutions, and executing them in the sandbox.
Sensitivity to Data Quality: The SFT stage relies on curated problem-solution-critique triplets. If these seeds are noisy or incomplete, the RL stage could converge to suboptimal feedback policies.
Timeout vs. Thoroughness: As evidenced in some empirical results, thoroughness in code solutions can inflate runtime. While it raises overall correctness, it might be at odds with real-time constraints in certain industrial contexts.

Despite these subtleties, the authors argue that the architecture of CTRL remains robust and flexible, offering a systematic manner to incorporate critique into an LLM-based coding assistance pipeline.

Future Directions

Enumerating potentially expansive future directions, the authors propose:

Extended Multi-Turn Critique: While the paper shows that single-turn training generalizes to multi-turn revisions, there may be further benefits from explicitly training on multi-turn protocols.
Diverse Domains: Beyond code, expansions into tasks with partial-credit rubrics, intricate solution structures (e.g., multi-step logical deductions), or uncertain feedback channels remain open trajectories.
Hallucination Mitigation: By more explicitly tying verification steps to external knowledge bases, a sophisticated critic might discourage freeform hallucinations.
Efficiency and Safety: The intended usage might pivot to optimizing the critic’s suggestions not only for correctness but also for adherence to safety guidelines, data privacy rules, or computational constraints. The authors mention broader possibilities of leveraging advanced RL frameworks that weigh multiple reward signals.
Scaling Laws for Critique: Just as LLM performance has scaling laws, there might be an analog for critics. Understanding the interplay between critic size, generator size, dataset scale, and pass/fail difficulty is a sobering topic for deeper exploration.

Conclusions

In summation, “Teaching Language Models to Critique via Reinforcement Learning” contributes a novel approach for systematically training language models to critique code outputs, bridging the useful chasm between raw pass/fail signals and targeted textual feedback. By formalizing the critique generation process via a two-stage framework—(1) execution-guided synthesis and (2) reinforcement optimization with Group Relative Policy Optimization—the authors position their critic model as a refined, decoupled sub-system that productively cooperates with an arbitrary generator. Results affirm that these learned critics facilitate multi-turn, iterative improvement while mitigating error propagation. Moreover, the synergy they observe between seemingly less powerful critics and more advanced generators highlights the capacity for “weak-to-strong” oversight.

The paper’s synergy between code correctness, structured hints, and dynamic textual feedback stands out. The rigorous experiments explore multiple datasets and model pairings, revealing not only superior performance but also deeper insight into the role of textual critiques in accelerating iterative refinement. Even though the methodology focuses on code tasks, the underlying logic is widely applicable to other domains where solution correctness can be objectively tested, reinforcing the general importance of well-designed critics.

By charting this path, the authors propose a broader paradigm: LLM-based systems need not rely on uniform self-reflection or unidimensional reward signals. Instead, they can incorporate specialized, systematically trained critic sub-models that push them both to correct mistakes and to transform solutions more significantly when the situation demands. Readers interested in the deeper basis and technical details are encouraged to read the full text at https://arxiv.org/abs/2502.03492 or visit the authors’ project page at https://critic-rl.github.io.

Sources

Teaching Language Models to Critique via Reinforcement Learning
Zhihui Xie, Jie Chen, Liyu Chen, Weichao Mao, Jingjing Xu, Lingpeng Kong.
arXiv preprint: https://arxiv.org/abs/2502.03492
Critic-RL Project Page
https://critic-rl.github.io
Reflexion
Shinn, N., Cassano, F., Gopinath, A., Narasimhan, K., and Yao, S.
arXiv preprint: https://arxiv.org/abs/2303.11366
Self-Refine
Madaan, A., Tandon, N., Gupta, P., Hallinan, S., Gao, L., Wiegreffe, S., Alon, U., Dziri, N., Prabhumoye, S., Yang, Y., et al.
arXiv preprint: https://arxiv.org/abs/2303.17651
Huang et al. (2023) on Self-Correction
Huang, J., Chen, X., Mishra, S., Zheng, H. S., Yu, A. W., Song, X., and Zhou, D.:
“Large language models cannot self-correct reasoning yet.”
arXiv: https://arxiv.org/abs/2310.01798
Li et al. (2022) on CodeContests
Li, Y., Choi, D., Chung, J., Kushman, N., Schrittwieser, J., Leblond, R., Eccles, T., Keeling, J., Gimeno, F., Dal Lago, A., et al.:
“Competition-level code generation with alphacode.”
Science, 378(6624):1092–1097, 2022. Link
Sun et al. (2024) on Code Generation
Sun, Q., Chen, Z., Xu, F., Cheng, K., Ma, C., Yin, Z., Wang, J., Han, C., Zhu, R., Yuan, S., et al.:
“A survey of neural code intelligence: Paradigms, advances and beyond.”
arXiv: https://arxiv.org/abs/2302.10202
Shinn et al. (2024) on Reflexion
Shinn, N., Cassano, F., Gopinath, A., Narasimhan, K., and Yao, S.:
arXiv: https://arxiv.org/abs/2303.11366
Christiano et al. (2018) on Weak-to-Strong Oversight
Christiano, P., Shlegeris, B., and Amodei, D.:
arXiv preprint: https://arxiv.org/abs/1810.08575
Jain et al. (2024) on LiveCodeBench
Jain, N., Han, K., Gu, A., Li, W.-D., Yan, F., Zhang, T., Wang, S., Solar-Lezama, A., Sen, K., and Stoica, I.:
“Livecodebench: Holistic and contamination free evaluation of large language models for code.”
arXiv: https://arxiv.org/abs/2403.07974
Liu et al. (2024a) on MBPP+
Liu, J., Xia, C. S., Wang, Y., and Zhang, L.:
“Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation.”
arXiv: https://arxiv.org/abs/2108.07732
Tan et al. (2024) on JudgeBench
Tan, S., Zhuang, S., Montgomery, K., Tang, W. Y., Cuadron, A., Wang, C., Popa, R. A., and Stoica, I.:
“Judgebench: A benchmark for evaluating llm-based judges.”
arXiv: https://arxiv.org/abs/2410.12784
Shao et al. (2024) on Group Relative Policy Optimization (GRPO)
Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y., Wu, Y., et al.:
arXiv: https://arxiv.org/abs/2402.03300
Irving et al. (2018) on Debate
Irving, G., Christiano, P., and Amodei, D.:
“AI Safety via Debate.”
arXiv: https://arxiv.org/abs/1805.00899

These sources collectively provide deeper context on topics such as self-correction, neural code intelligence, generative reward models, Markov chain approaches, code generation evaluations, and more. By integrating them, the authors of “Teaching Language Models to Critique via Reinforcement Learning” establish a grounded theoretical and empirical framework, situating CTRL at the forefront of critic-based LLM enhancement.