OpenAI-o1 AB Testing: Does the o1 model really do good reasoning in math problem solving? Summary

Introduction and Motivation

The paper investigates the reasoning capabilities of a recently released large language model (LLM) developed by OpenAI, referred to as the “o1” or “Orion-1” model. Since its unveiling in September 2024, o1 has garnered attention for its purportedly superior logical reasoning abilities, especially in solving challenging mathematical problems. Previous large language models, like GPT-4 and Claude 3.5, have also shown impressive reasoning skills. However, their performance sometimes relies heavily on patterns recognized from their pre-training data. This reliance leads to questions about whether their success is due to genuine reasoning or to memorized solutions.

The o1 model is claimed by OpenAI to have a robust logical reasoning capability. OpenAI even suggests that o1-mini, a smaller variant of the o1 model, can achieve performance comparable to top high school contestants in the American Invitational Mathematics Examination (AIME). However, some early investigations have cast doubt on these claims. In private tests with high-school level math problems not typically found online, the o1 models have at times underperformed. This raises the question: Is o1’s reasoning genuinely robust? Or is it merely an artifact of encountering familiar training examples?

OpenAI-o1 AB Testing: Does the o1 model really do good
reasoning in math problem solving?

To address this question, the authors propose a form of “A/B” testing. They use two sets of mathematically challenging problems. One set is widely known and likely to appear in model training data. Another set is more obscure and less likely to have been seen by the model. By comparing o1-mini’s performance on these two sets, the authors hope to assess whether the model’s good performance on public benchmarks is due to reasoning or memorization. Specifically, if o1’s ability to handle well-known problems is substantially greater, it suggests that memorization plays a large role. This is compared to its ability on equally difficult but less publicly accessible problems. On the other hand, if o1’s performance is similar on both sets, it shows that the model’s reasoning ability is genuine. This similarity also proves that the ability is generalizable.

Background and Previous Work

Evaluating LLMs on mathematical reasoning tasks is an active area of research. Early studies focused on elementary and high-school level mathematics, with benchmarks like the MATH dataset (Cobbe et al., 2021) and AIME-style questions. More recent efforts have progressed into more advanced mathematics. They have even reached Olympiad-level. Here, problems demand creative and rigorous reasoning. This is instead of straightforward pattern recognition.

A number of papers have tested models like GPT-4 on International Mathematical Olympiad (IMO) problems and related tasks. GPT-4 and earlier LLMs have shown remarkable progress. However, their success rate on Olympiad problems remains limited. Some correct answers come from pattern-matching or partial memorization rather than reasoning from first principles.

The o1 model, which uses token-wise reinforcement learning to encourage a chain-of-thought reasoning style, appears to be a step forward in cultivating genuine reasoning capabilities. Some research indicates that o1 surpasses previous models in reasoning about code, logic puzzles, and mathematical proofs. At the same time, critics wonder if o1’s improved performance is still heavily influenced by memorization. This paper contributes to the debate. The paper sets up an experiment. It tests whether o1’s strong performance at solving well-known competition-level problems derives from exposure during training or from reasoning skill.

Methodology

Dataset ConstructionThe authors create two datasets of challenging high-school level math problems. These tasks are chosen to approximate the level of complexity and rigor seen in the IMO. The first dataset consists of 60 problems selected from recent International Mathematical Olympiads (IMO). The IMO is the highest-level high-school math competition, and its problems are well-known, thoroughly documented, and readily available online. Given their fame and the open availability of past IMO problems, it is highly plausible that these problems (or very similar ones) appeared in o1’s training corpus.The second dataset consists of 60 problems from the Chinese National Team (CNT) training camp. The CNT training is a rigorous, closed environment for selecting China’s IMO representatives. Each year, trainees solve many challenging problems that are similar in difficulty and style to IMO problems, but these problems are not as widely circulated. Thus, they are far less likely to appear in publicly accessible training data. The paper’s authors argue that CNT problems are about as challenging as IMO problems, but less accessible and thus less likely to have been “memorized” by the model. The authors compare performance on these two sets. One is highly accessible (IMO) and the other is more private (CNT). They aim to test whether public availability affects performance. If o1-mini’s reasoning stems primarily from memorization of known solutions, then performance on IMO problems should exceed that on CNT problems. Conversely, if reasoning ability is robust and general, o1-mini should perform roughly equivalently on both sets.
Problem Types and Grading CriteriaThe problems are classified into three main categories: “Proof,” “Search,” and “Solve”:
Since formal proofs are challenging to produce automatically, the authors relax the grading. A standard IMO grading allocates up to 7 points per problem: 1 point for the final answer in numerical or short-answer tasks, 2 points for plausible main ideas, and up to 4 points for a rigorous, fully justified proof. The authors simplify their assessment. They focus on whether o1-mini can produce the correct final answer. They also consider if it can generate a plausible reasoning chain, even if not fully rigorous. Crucially, the authors note two important points. First, o1-mini often uses heuristic trial-and-error techniques. Second, it uses “guessing” techniques, such as testing small cases for search problems. While this approach may yield correct solutions in some instances, it typically lacks the rigorous justification expected in Olympiad solutions. Nonetheless, for the sake of analysis, correctness of the final answer carries significant weight in the evaluation.
Testing ProcedureThe authors feed the problems to o1-mini without special prompting. The problems are provided in LaTeX form, which the model can parse and interpret. The model’s output is then evaluated by human graders familiar with such competitions.By examining whether the model can produce correct final answers on each problem, the authors extract a success rate. They compute an “accuracy ratio” for both the IMO and CNT sets, focusing especially on Search and Solve type problems. The reason is that these categories often have a definite final answer (e.g., a specific function or a set of integers), making accuracy easier to assess.

Results and Statistical Analysis

Overall Performance Across DatasetsAfter evaluating all 120 problems (60 from IMO and 60 from CNT), the authors find that the performance of o1-mini does not differ dramatically between the two sets. Specifically:
- For “Search”-type problems, the model’s accuracy is roughly 70% on both IMO (16 out of 23) and CNT (19 out of 27). A difference in performance that small is not statistically significant.For “Solve”-type problems, the model’s accuracy is also similar: about 21% for IMO and 22% for CNT.
Combining these results, the authors do not observe a statistically meaningful advantage on the publicly well-known IMO problems over the less public CNT problems. The performance levels are strikingly close.
Rejecting the Memorization HypothesisThe key hypothesis being tested is that if o1-mini relies heavily on memorization (i.e., has “seen” the IMO problems during training), then it should perform substantially better on the IMO set than on the CNT set. Since no such discrepancy was found, the authors argue that the hypothesis of pure memorization can be rejected. The consistent performance suggests that o1-mini’s abilities are based on reasoning patterns. These patterns generalize across problems. They do not rely on regurgitating known solutions. Even though o1’s performance is not perfect, the lack of a gap in performance between known and less-known problems implies that the model is not merely replaying memorized solutions for common competition problems.
Comparison to Benchmark ModelsThe authors note that GPT-4, widely considered one of the strongest commercially available LLMs before o1, achieved about 40% accuracy on high-level Olympiad benchmarks in prior studies. The results for o1-mini in this study are in a similar range. The fact that o1-mini performs roughly at a 48–51% level for combined Solve and Search tasks across both IMO and CNT datasets, and that this performance does not degrade for the private dataset, is seen as evidence of generalizable reasoning skill.

Qualitative Observations and Case Studies

Beyond the main statistical tests, the authors present several case studies. These case studies illustrate how o1-mini solves problems. They also show where it falls short. The case studies provide insight into the nature of o1-mini’s reasoning and its typical patterns, strengths, and weaknesses.

Positive Example: Providing IntuitionIn one example, o1-mini is given a problem that involves placing stones on a grid while satisfying certain combinatorial constraints. The model reasoned about coloring arguments and symmetric properties, identifying a key insight that solved the problem. While it lacked a complete rigorous proof, the model’s intuition aligned well with a standard human solution approach. This shows that o1 can mimic the kind of reasoning that reduces a complex configuration problem to a known combinatorial argument, suggesting a semblance of higher-level thought.
Guessing and Partial JustificationsFor many “Search” type problems where the solution set might be integers or specific functions, o1-mini adopts a trial-and-error approach. It tests small values and attempts to detect patterns. For instance, given a Diophantine equation, the model tries a few small integers, notices a pattern, and then makes a guess about the general form of the solution. It might confirm that prime powers work by checking small primes and drawing a plausible conclusion. While this is far from a rigorous proof, the model often succeeds in identifying correct solutions. The shortcoming is that it rarely provides the all-important justification that no other solutions are possible.This tendency to guess suggests that o1-mini can explore solution spaces heuristically, which may come off as reasoning or problem-solving, but from a human perspective, it looks more like a heuristic search than a proof. Still, the method yields correct answers frequently enough to raise its accuracy rate.
Comparison to Human ReasoningIn some case studies, the authors compare a plausible human solution path to the approach o1-mini takes. They note that a human solution to a complicated Olympiad problem might systematically consider subcases. It would prove no extraneous solutions exist and give thorough justifications. In contrast, o1-mini often provides a skeletal reasoning line without rigorous details. It might try a series of computations and inferences without fully connecting the dots. The difference lies in rigor: the model might identify the “big idea” or generate a key insight but fails to anchor it with precise arguments.
Weakness in “Proof” ProblemsO1-mini struggles more with proof-oriented problems, as expected. While it can sometimes outline the major ideas of a proof, it rarely furnishes the kind of precise, step-by-step logical argument that a math competition requires. The model might know the final result or state something resembling the structure of a known proof, but fails to argue rigorously. The authors highlight that formal mathematical proofs place a high premium on rigor, something LLMs currently find challenging.
Difficulties with Complex Search StrategiesThe authors present a problem involving a grid and hidden monsters, where a snail (Turbo) tries to navigate from the top row to the bottom row. The optimal solution requires a clever strategy ensuring the snail can find the “safe” path in minimal attempts. Humans devise a plan that exploits symmetries and elimination of possibilities to guarantee a solution in at most 3 attempts. O1-mini, on the other hand, suggests a much larger upper bound on attempts (e.g., 2023 attempts), failing to find the intricate combinational insight that compresses the search space dramatically. This example showcases that while o1 can guess or pattern-match in algebraic or number theory tasks, it struggles with strategic reasoning that requires a global, optimized approach. The complexity of spatial and combinational problem-solving still poses a challenge for LLMs.

Discussion and Implications

The main contribution of the paper is the demonstration that o1’s performance does not degrade significantly when moving from widely accessible problems (IMO) to less accessible ones (CNT). The authors interpret this result as evidence that o1’s high performance is not primarily due to memorization. Rather, it suggests that o1 does possess some form of reasoning that generalizes.

However, the authors are careful to note that “reasoning” as exhibited by o1 still falls short of what a human expert would consider rigorous mathematical reasoning. The model’s reasoning is more akin to heuristic guesswork guided by patterns learned during training. O1 often fails to provide complete justifications or to handle the most complex combinational arguments elegantly. Nonetheless, this type of reasoning is still a step forward from mere memorization of solutions.

The paper’s results contribute to the broader evaluation of LLMs in mathematical reasoning. The authors join an emerging literature. It emphasizes the importance of testing models on private datasets. It also includes problem sets with different levels of accessibility. By doing so, researchers can distinguish between mere memorization and genuine reasoning capabilities.

Also, the authors highlight that while o1’s performance might be consistent across different datasets, it has room to improve in terms of rigor and versatility. The findings call for continuing refinement of training methods, possibly involving reinforcement learning over proofs and formal reasoning steps. Future models need to bridge the gap between intuitive pattern recognition and formal rigor. This will make them more reliable proof assistants and problem solvers.

Limitations and Future Work

The authors acknowledge several limitations. First, their evaluation metric is somewhat lenient. They do not fully penalize the model for lacking rigorous proof steps. Instead, they focus on correctness of final answers or partial reasoning steps. A more stringent grading, matching the standards of formal Olympiad grading, would likely show lower success rates.

Second, the datasets, while carefully chosen, are still limited in size (60 problems from each of IMO and CNT). More extensive testing with a broader variety of private problem sets could provide stronger evidence.

Third, the authors have not performed a detailed token-level or chain-of-thought analysis to see how exactly the model arrives at solutions. Such an analysis might reveal whether certain solution sketches are memorized templates or genuinely novel reasoning sequences.

Future research could incorporate automated theorem provers, comparing o1’s informal reasoning with formal proof search methods. Another avenue is to combine o1’s intuitive leaps with a formal verification step, thus enhancing rigor. Additionally, investigating how training methods, data filtering, and explicit curriculum design might improve the model’s ability to produce rigorous arguments would be valuable.

Conclusion

This study presents a careful A/B test. It investigates whether the OpenAI o1 model’s strong performance on known, publicly accessible Olympiad problems is due to memorizing training examples. By comparing the performance of o1-mini on a well-known dataset (IMO) and a comparable but less accessible dataset (CNT), the authors find no significant performance difference. This result suggests that the model’s reasoning ability is not just a product of encountering the same problems during training.

Instead, o1 seems capable of generalizing its problem-solving approach to novel, unseen problems of similar difficulty. The model’s reasoning ability, however, remains imperfect. O1 often lacks rigorous justification steps and relies on heuristic reasoning patterns. Still, the lack of a performance gap between public and private datasets indicates that the model’s success is not solely due to memorization.

The paper contributes an important piece of evidence. It supports the claim that LLMs like o1 do exhibit some genuine reasoning capabilities. They are not limited to regurgitating training data but can extend their reasoning to newly encountered problems with similar complexity. While the journey toward fully rigorous, human-like mathematical reasoning continues, the results here show promising progress in that direction.