s1: Simple Test-time Scaling - Paper Summary

In the arena of contemporary language model research, s1: Simple Test-Time Scaling meticulously chronicles an ambitious and innovative project: the construction of a specialized dataset, s1K, the fine-tuning of a 32B-parameter model (s1-32B), and the systematic exploration of how to scale that model’s test-time compute. It stands at the intersection of advanced model design, data curation, and theoretical inquiry, aspiring to ensure that techniques once shrouded behind closed doors become fully open and replicable.

2501.19393v2 Download

Modern large language models (LLMs) exhibit remarkable aptitude across myriad tasks, from arithmetic and geometry to biology and quantum theory. Yet, many frontiers remain poorly understood, particularly the notion of test-time scaling. The notion is straightforward, but its impact is profound: at inference, one can allocate more computational steps (or “thinking tokens”) to potentially boost a model’s final accuracy. But how does one control those steps effectively? How large can they get before saturating performance? And might that same approach drive performance upward on exceptionally difficult problems, or does it buckle when context windows max out?

This new paper attempts to provide clarifying insights. It describes a twofold achievement: first, the creation of a minimal data corpus for reasoning tasks, spanning exactly 1,000 meticulously curated samples (the “s1K” dataset). Second, the authors exhibit a technique called “budget forcing,” which effectively compels the model at inference time to generate additional or fewer intermediate steps, thus scaling the model’s compute usage. They test out the technique across recognized benchmarks—AIME24 for high school mathematics, MATH500 for more advanced math tasks, and GPQA Diamond for graduate-level biology, chemistry, physics, and other scientific queries.

By revealing substantial accuracy boosts and stable control over the number of “thinking tokens,” the paper clarifies how test-time scaling is not merely an esoteric curiosity. Rather, it is a vital means to extract advanced reasoning from a model, pushing it to systematically refine or re-think solutions. While many prior works (e.g., Wei et al., 2023) have explored “Chain-of-Thought” prompting, or tree-based search strategies (Wu et al., 2024b), the essence of this new approach is simpler: forcibly hold the model open, instruct it to wait, or break off its chain of thought prematurely, depending on the user’s preference. Through these manipulations, the outcome is often remarkable, not solely in raising ultimate problem-solving capacity, but also in enabling more predictable usage of computational resources.

Genesis of s1K: Quality, Difficulty, Diversity

Fundamental to the entire enterprise is a 1,000-sample dataset, named s1K. Hunting for an eclectic mix that covers mathematics (geometry, real analysis, number theory, partial differential equations, etc.) and scientific topics (biology, quantum theory, electromagnetic fields, computer science, and more), the authors set out three guiding data principles:

High Quality: They intentionally exclude samples found to contain formatting issues, ASCII diagrams that garble text, or incomplete references. The data needed to be clearly comprehensible and incorporate correct solutions for each question.
High Difficulty: They used model-based filtering. By testing each sample with Qwen2.5-7B-Instruct and Qwen2.5-32B-Instruct, if either model solved it correctly, the question was deemed “too easy” and removed. This ensured only the perplexing or advanced problems remained.
Diversity: To avoid overfitting on a narrow domain, they took questions from 50 distinct categories ranging from combinatorics to thermodynamics, guaranteeing breadth.

They began with an expansive 59K set of candidate items from numerous sources. One prominent reference is MATH (Hendrycks et al., 2021), an influential competition math dataset. Another is AGIEval (Zhong et al., 2023), featuring exam questions from SAT, LSAT, and more. Additional sets included OlympiadBench (He et al., 2024), OmniMath (Gao et al., 2024a), SciEval (Sun et al., 2024), USACO computational problems (Shi et al., 2024), and others. By eliminating easy or low-quality items, the authors ultimately compressed 59K raw samples to s1K’s final 1,000.

Interestingly, the authors also mention long reasoning traces as an implicit measure of difficulty: if a query triggered a chain of 5,000+ tokens of intermediate reasoning in Qwen2.5-32B-Instruct, it was presumably more challenging. Indeed, among the s1K samples are multi-step geometry proofs, advanced linear algebra conundrums, integrable random variable derivations, quantum wavefunction analyses, and more. The authors share a short excerpt about conditional expectations of random variables in advanced probability—a tiny glimpse into s1K’s complexity.

The Pillars of Test-Time Scaling

The central theme is test-time scaling, or how to systematically enlarge the model’s iterative reasoning at inference to amplify accuracy. One can do this in a “sequential” manner, letting the LLM refine its internal chain of thought step by step, or “parallel,” generating many solutions simultaneously and then aggregating them (e.g., by majority vote).

The paper classifies extant or novel methods to scale test-time compute:

Sequential Methods
- Budget Forcing: The authors’ contribution. This directly manipulates the upper or lower bound on “thinking tokens” by forcibly inserting either end-of-thinking tokens or extra statements like “Wait.” If the LLM tries to end early, they re-append “Wait,” forcing it to reflect further. Conversely, if they want fewer steps, they insert the end-of-thinking delimiter quickly.
- Step-Conditional Control: The prompt demands a certain number of high-level reasoning steps, e.g., “Provide a 5-step solution.” The model tries to produce precisely that many steps.
- Token-Conditional Control: The user specifies “Use no more than 1,000 tokens to think,” or some similar directive.
- Class-Conditional Control: The user indicates “use short thinking” or “use long thinking.”
Parallel Methods
- Majority Voting: Sample multiple solutions in parallel, then choose the most frequent final answer.
- Best-of-N Selection: Sample multiple solutions, then pick the final answer that a reward model ranks highest.
- Tree Search Variants: Such as REBASE (Wu et al., 2024b) or Monte Carlo Tree Search. Often a process-level reward model is used to prune or refine partial solutions.

Within the sequential domain, budget forcing is singled out as strikingly simple but surprisingly potent. It does not require an extensive reward model, nor does it demand a new specialized search tree structure. Instead, it harnesses the fact that these LLMs either know how to reflect longer if asked or can be coerced to produce an answer quickly. The authors highlight a scenario where the model starts to converge on an incorrect solution, but forcibly reminding it to “Wait” triggers re-evaluation that leads to an eventual correct or more accurate final answer.

Evaluative Framework

To measure the utility and reliability of these scaling methods, the paper proposes three metrics:

Control: Emphasizes whether the method truly adheres to the user’s request regarding compute usage. For instance, if the user demands 4,000 tokens, do they get 4,000±some small deviation, or does the model ignore it altogether?
Scaling: Captures how effectively performance rises as more compute is permitted. Is the slope steep? Does doubling tokens yield a consistent jump in accuracy?
Performance: Details the absolute maximum accuracy that can be squeezed from the method.

They show piecewise linear plots, with the x-axis signifying the average number of tokens spent “thinking,” and the y-axis signifying accuracy. The final measure of success is if these methods can systematically raise the curve. The authors stress that controlling for context window constraints remains a potential bottleneck. If the model is forced to produce 20,000 tokens of context, it can approach or even surpass the 32k or 64k token limit of typical architectures, eventually truncating. Similarly, ignoring the end-of-thinking token too many times can lead the LLM into repetitive loops or an indefinite stalling scenario.

Results: The Emergence of s1-32B

The authors begin by describing their fine-tuning process. They took Qwen2.5-32B-Instruct (Qwen et al., 2024), a high-performing 32B parameter model, and performed supervised fine-tuning (SFT) on the s1K dataset. Because s1K is relatively small, training converged in just 26 minutes on 16 NVIDIA H100 GPUs. They did not apply elaborate hyperparameter choices. Instead, they used:

A learning rate of 1e−5.
Five epochs in total (so about 315 gradient steps).
Bfloat16 precision.
AdamW optimizer (Loshchilov & Hutter, 2019) with 0.9 / 0.95 betas.

They call their newly minted model s1-32B. Immediately, the authors highlight that, despite training on a mere 1,000 examples, s1-32B climbs to a performance regime rivaling or surpassing certain larger open models. They compare it to QwQ-32B (Team, 2024b), r1-distill (DeepSeek-AI et al., 2025), and other open solutions. On tough benchmarks like AIME24 (the American Invitational Mathematics Examination from early 2024) or GPQA Diamond (Rein et al., 2023), s1-32B’s accuracy is quite competitive:

AIME24: s1-32B hits around 50.0%–57.0% accuracy depending on the budget forcing method used (the baseline Qwen2.5-32B-Instruct remains at ~26.7%).
MATH500: s1-32B edges around 93.0%.
GPQA Diamond: s1-32B hovers near 59.6%.

That last dataset is especially challenging because it requires advanced knowledge in graduate-level biology, chemistry, and physics, tested with open-ended responses. So near-60% correctness is notable, given that experts with relevant PhDs see around 69.7% on the same subset (OpenAI, 2024). Another standout competitor is r1-distill, which uses 800k reasoning samples. Even with that massive dataset, r1-distill hovers around 72.6% on AIME24 and 62.1% on GPQA Diamond—somewhat better on AIME but only slightly higher on GPQA relative to s1-32B. The crucial difference, though, is that s1-32B employed merely 1,000 examples, far fewer than r1-distill’s 800k. Indeed, the authors label s1-32B as a “sample-efficient reasoning model.”

Budget Forcing in Action

One central experiment is a demonstration of budget forcing. The authors systematically run s1-32B on AIME24 while forcing minimum and maximum token budgets. For example:

Minimum Bound (“Wait”)
When the model tries to produce an end-of-thinking token prematurely, the authors append “Wait” and suppress the end-of-thinking delimiter, prompting the model to keep reasoninging. Repeatedly applying this technique can double or triple the chain-of-thought length. They label these experiments “2x,” “4x,” or “6x” ignoring the end-of-thinking token. The results show that s1-32B’s AIME24 accuracy rises from 50% to ~57% (about a 7-point gain) before plateauing.
Maximum Bound
They forcibly insert the end-of-thinking token if the chain extends beyond a certain threshold (e.g., 2,000 tokens). This ensures the model can’t go on forever. The control metric is near 100% because it is straightforward to chop reasoning short.

The scaling metric—the slope of the piecewise line from low to high test-time compute—indicates that performance climbs as allowed reasoning tokens increase, up to a certain leveling-off. Interestingly, for simpler tasks like MATH500, gains appear smaller (perhaps from 91% to 93%, since the baseline is already fairly good). For the more challenging GPQA Diamond, s1-32B sees a moderate boost from ~56.6% to nearly 59.6%.

Yet, an upper limit emerges: if the context window is 32k tokens, once the model has spent 28k tokens on the chain-of-thought, only 4k remain for the final answer. If that final answer is incomplete, performance can degrade. The authors highlight that repeated ignoring of the end-of-thinking token can lead the LLM into cycles. They show an example: an erroneous reflection that loops infinitely until forcibly truncated.

Comparison to Other Scaling Methods

The paper also examines alternative ways to harness more test-time compute:

Parallel Approaches
A baseline is Majority Voting. For the base model, Qwen2.5-32B-Instruct, generating 2, 4, 8…64 parallel samples with a temperature of 1, then choosing the most frequent final solution, can raise AIME24 performance from 26.7% up to ~40%–45%. That’s meaningful but not as high as the 50%–57% from s1-32B’s sequential approach. Another parallel method, REBASE, uses a reward model to prune partial solutions (Wu et al., 2024b). Because REBASE re-infers at each step, it can also achieve strong performance. However, it demands an entire additional reward model forward pass, incurring overhead.
Token-/Step-Conditional Control
The authors reveal that if they instruct the model, “Please think in exactly 5 steps,” the LLM tries to produce 5 steps, but often lumps multiple sub-steps into each enumerated line or occasionally tries to circumvent the token limit. They observe the phenomenon of “compute hacking,” where the LLM, told to do “fewer steps,” tries to stuff many tokens into each step, or told to do “many steps,” it uses minimal text in each. The net outcome is subpar controllability.
Rejection Sampling
This approach repeatedly samples from the model with a certain temperature, ignoring all solutions that exceed a length threshold. The authors show an “inverse scaling” phenomenon: if the threshold is set short, the model discards a huge fraction of samples. Indeed, requiring an average of 655 tries per example is computationally massive, and sometimes performance declines.

In comparing them all, the paper insists that budget forcing is a more direct method, simpler to implement, with decent controllability and good scaling. It’s not necessarily the single best approach for absolute maximum accuracy—tree search or best-of-N might surpass it—but it has fewer external components and remains highly convenient for tasks that do not demand a separate reward model.

Further Investigations and Future Directions

While the authors celebrate test-time scaling’s potential, they identify limitations:

Flattening Gains: Even budget forcing eventually stops helping. Past ~57% accuracy on AIME24, “Wait” or extended reflection yields diminishing returns. The model performs iterative expansions but does not necessarily discover new, deeper insights. More advanced meta-reasoning or specialized search might be needed for further leaps.
Context Window Restrictions: If the context is 32k, the model can’t indefinitely reflect. Solutions for infinite context—like streaming generation or external scratchpad memory—might be necessary for truly unbounded test-time scaling.
Risk of Loops or Repetitions: Repeatedly ignoring the end-of-thinking token can cause the model to produce unproductive cycles. Some method to detect self-repetition, or to enforce novelty in each chunk, might help.

Another angle the authors discuss is whether reinforcement learning approaches to training might encourage better test-time scaling. For instance, they ask if an RL approach that tightly couples the model’s reward to the correctness of extended solutions might produce more robust budget forcing. Or whether advanced search frameworks, such as a systematically guided MCTS with a process-level reward model, could push “extrapolation” further at test time.

Finally, the authors strongly recommend standardizing the metrics of control, scaling slope, and maximum performance when describing test-time scaling. By quantifying these factors, future research can compare different solutions more directly.

s1K Data Ablations

The paper includes a dedicated ablation that compares the curated s1K with smaller or differently assembled datasets. For instance, they try a “1K-random” dataset—merely plucking 1,000 random questions from the original 59K set—and retrain. They also try a “1K-diverse” dataset that attempts to maximize domain coverage but does not filter out easier problems. Both yield weaker results, especially on AIME24, relative to s1K—sometimes dropping from 56.7% accuracy down to as low as 26.7%. The authors stress that difficulty plus careful domain variety is crucial for robust quality.

Broader Comparisons and the Road Ahead

This paper engages with multiple parallel efforts:

OpenAI’s o1 Family (OpenAI, 2024). A closed-source approach that popularized “test-time scaling” for advanced reasoning, achieving top-tier results on MATH500 and GPQA.
DeepSeek-r1 (Ren et al., 2025; Team et al., 2025; https://arxiv.org/abs/2501.12948). Another large-scale open model using reinforcement learning, significantly surpassing many prior baselines.
QwQ-32B (Team, 2024b). An open 32B model claiming robust reasoning skills, albeit with undisclosed or vague dataset details.
Sky-T1 (Team, 2025). That model compiles a$450 budget approach to distilling a 32B model from a teacher.
Bespoke-Stratos (Labs, 2025). Distills from r1, focusing heavily on reasoning processes.

Across these endeavors, s1-32B is singled out as “sample efficient,” meaning it needed a fraction of the training data (just 1,000 curated examples) to achieve performance near or beyond older big-data models. The authors connect this to the “Superficial Alignment Hypothesis” (Zhou et al., 2023), which suggests that large pre-trained LLMs already contain latent reasoning capabilities, requiring only a small set of carefully chosen triggers to unlock them. s1K, they argue, is exactly such a carefully curated set.

The significance resonates more widely than the immediate benchmarks: for future expansions in advanced math, scientific reasoning, or specialized tasks (like theorem proving or code debugging), s1K’s success might motivate specialized micro-datasets. Tuning on thousands or tens of thousands of random examples might be less beneficial than picking 1,000 “brutally tough and instructive” ones.

Scientifically Open Collaboration

An important subtext in the paper is a push toward open science. They reveal the entire code, dataset, and instructions, referencing the training environment on Stanford’s Marlowe GPU cluster (Kapfer et al., 2025). They caution that reproducibility remains delicate with large reasoning contexts, especially using GPU-accelerated frameworks like vLLM (Kwon et al., 2023), which can produce slight numeric drift that leads to drastically different final tokens. Yet the authors believe that, by documenting these pitfalls, they create a baseline for others to replicate or improve upon. The references to HPC infrastructure, weaving among varied data sources, and the honesty about erroneous loops during extremely long chains of thought all contribute to a sense of thorough, transparent analysis.

Conclusion: Taming Complexity with Simple Tools

This paper sculpts a vantage point from which we see how cunningly small, but precise, interventions can yield large leaps in performance. Operators can tweak the model’s “thinking budget,” and the model, if well-trained to parse such instructions, responds by weaving deeper or shallower expansions. This, the authors argue, fosters an environment where the “optimal” reasoning length might be discovered on the fly. Even on intricate tasks like geometry proofs or quantum scattering analyses, s1-32B can “just keep going” if commanded, occasionally unearthing new lines of logic that unlock the correct solution.

But they also underscore real challenges. For instance, forcibly extending the chain-of-thought can degrade performance once the context limit is near. If one aims to push well beyond 32k or 100k tokens, a more structural approach—maybe an iterative external scratchpad or a memory-based agent—might be needed. Also, if the user can’t or won’t trust a model to “eventually self-correct,” they might prefer a parallel or tree-based approach, collecting many attempts and using a reward model to filter them.

Ultimately, the study’s contribution weaves together:

A new dataset (s1K) to teach reasoning in a broad, advanced domain with minimal examples.
A new model (s1-32B) that exemplifies how just 1,000 well-chosen samples can replicate much of the success of bigger reasoning-based models.
A thoughtful taxonomy of test-time scaling methods—parallel vs. sequential—and a championing of budget forcing as the simplest route to controlling how many steps a model takes.

This synergy of fine-grained data curation, short but potent training, and robust inference manipulations stands as a powerful statement. It invites anyone fascinated by large language models, mathematical or scientific AI breakthroughs, or new directions in open-source collaboration to explore further.

References to Linkable Sources Cited

Ren, Z., Sha, Z., Fu, Z., Xu, Z., Xie, Z., Zhang, Z., Hao, Z., Ma, Z., Yan, Z., Wu, Z., Gu, Z., Zhu, Z., Liu, Z., Li, Z., Xie, Z., Song, Z., Pan, Z., Huang, Z., Xu, Z., Zhang, Z., & Zhang, Z. (2025). Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. https://arxiv.org/abs/2501.12948.
Brown, B., Juravsky, J., Ehrlich, R., Clark, R., Le, Q. V., Ré, C., & Mirhoseini, A. (2024). Large language monkeys: Scaling inference compute with repeated sampling. https://arxiv.org/abs/2407.21787.
Gao, L., Tow, J., Biderman, S., Black, S., DiPofi, A., Foster, C., Golding, L., Hsu, J., McDonell, K., Muennighoff, N., Phang, J., Reynolds, L., Tang, E., Thite, A., Wang, B., Wang, K., & Zou, A. (2021). A framework for few-shot language model evaluation. https://doi.org/10.5281/zenodo.5371628.
Hendrycks, D., Burns, C., Kadavath, S., Arora, A., Basart, S., Tang, E., Song, D., & Steinhardt, J. (2021). Measuring mathematical problem solving with the math dataset. https://arxiv.org/abs/2103.03874.
Zhong, H., Xiao, C., Tu, C., Zhang, T., Liu, Z., & Sun, M. (2023). Agieval: A human-centric benchmark for evaluating foundation models. https://arxiv.org/abs/2304.06364.
Wu, Y., Sun, Z., Li, S., Welleck, S., & Yang, Y. (2024b). Inference scaling laws: An empirical analysis of compute-optimal inference for problem-solving with language models. https://arxiv.org/abs/2408.00724.
Kwon, W., Li, Z., Zhuang, S., Sheng, Y., Zheng, L., Yu, C. H., Gonzalez, J. E., Zhang, H., & Stoica, I. (2023). Efficient memory management for large language model serving with pagedattention. https://arxiv.org/abs/2309.06180.
Kapfer, C., Stine, K., Narasimhan, B., Mentzel, C., & Candes, E. (January 2025). Marlowe: Stanford’s gpu-based computational instrument. https://doi.org/10.5281/zenodo.14751899.
Loshchilov, I. & Hutter, F. (2019). Decoupled weight decay regularization.
(ArXiv preprint, no direct link provided in the excerpt, but typically https://arxiv.org/abs/1711.05101 is known for AdamW.)
OpenAI. (September 2024). Learning to reason with llms. https://openai.com/index/learning-to-reason-with-llms/ (URL from the document excerpt.)
Rein, D., Hou, B. L., Stickland, A. C., Petty, J., Pang, R. Y., Dirani, J., Michael, J., & Bowman, S. R. (2023). Gpqa: A graduate-level google-proof q&a benchmark. https://arxiv.org/abs/2311.12022.
Google. (December 2024). Gemini 2.0 flash thinking mode (gemini-2.0-flash-thinking-exp-1219). https://cloud.google.com/vertex-ai/generative-ai/docs/thinking-mode.
Team, D. (November 2024a). Deepseek r1. https://x.com/deepseek_ai/status/1859200141355536422.
Labs, B. (2025). Bespoke-stratos: The unreasonable effectiveness of reasoning distillation. https://hf.co/bespokelabs/Bespoke-Stratos-32B.

s1: Simple Test-time Scaling – Paper Summary

Genesis of s1K: Quality, Difficulty, Diversity

The Pillars of Test-Time Scaling

Evaluative Framework

Results: The Emergence of s1-32B

Budget Forcing in Action

Comparison to Other Scaling Methods

Further Investigations and Future Directions

s1K Data Ablations

Broader Comparisons and the Road Ahead

Scientifically Open Collaboration

Conclusion: Taming Complexity with Simple Tools

References to Linkable Sources Cited

Related Guides

Recent Launches

Latest News

Continue Reading

Genesis of s1K: Quality, Difficulty, Diversity

The Pillars of Test-Time Scaling

Evaluative Framework

Results: The Emergence of s1-32B

Budget Forcing in Action

Comparison to Other Scaling Methods

Further Investigations and Future Directions

s1K Data Ablations

Broader Comparisons and the Road Ahead

Scientifically Open Collaboration

Conclusion: Taming Complexity with Simple Tools

References to Linkable Sources Cited

Related Guides

Recent Launches

Latest News

Continue Reading

Get The Kingy Brief.

Get The Kingy Brief.