Godel Test: Can Large Language Models Solve Easy Conjectures? - Paper Summary

Artificial intelligence has already conquered a long list of benchmarks. From passing the bar exam to outperforming humans on high school math competitions, large language models (LLMs) like GPT-4 and GPT-5 have demonstrated remarkable capabilities. But a deeper, more pressing question remains:

Can AI do more than replicate what we already know? Can it create and prove new mathematics?

A recent paper, Gödel Test: Can Large Language Models Solve Easy Conjectures? (arXiv:2509.18383), tackles exactly this question. Written by Moran Feldman (University of Haifa) and Amin Karbasi (Cisco Foundation AI), it proposes a new benchmark called the Gödel Test—a way of evaluating whether an AI system can produce correct proofs for simple but previously unsolved conjectures.

This article explores the paper in depth, why it matters, what it discovered about GPT-5, and what it means for the future of AI and scientific reasoning.

2509.18383v1 Download

Why Do We Need a New Benchmark for AI Reasoning?

In recent years, OpenAI, Google DeepMind, and Anthropic have proudly announced that their models are performing at medal-winning levels on the International Mathematical Olympiad (IMO). These are extremely challenging problems, designed to stretch the limits of human ingenuity at the high school level.

And yet, there’s a catch.

IMO problems, while difficult, are known problems. Their solutions exist in the literature. A sufficiently trained model with memorization, pattern recognition, and brute-force reasoning can eventually work through them.

The bigger question is:

What happens when you give an AI a problem no one has solved before?
Can it move beyond imitation and produce new knowledge?

This is what Feldman and Karbasi aim to measure with the Gödel Test. Instead of feeding AI tricky textbook puzzles, they test whether models can prove brand-new, easy conjectures—problems that haven’t been written down before but should be solvable by a competent graduate student in a day or two.

It’s a shift from “Can AI solve past human challenges?” to “Can AI take a step into the unknown?”

The Gödel Test: What Is It?

Named after the legendary logician Kurt Gödel, whose incompleteness theorems reshaped mathematics, the Gödel Test is not about impossibility results. Instead, it’s a litmus test for originality.

The rules are simple:

Pick a branch of mathematics.
Pose a new conjecture—something not found in existing papers.
Give the AI only the background context and references needed.
See whether it can produce a correct, rigorous proof.

The authors chose submodular maximization, a subfield of combinatorial optimization with wide applications in machine learning (for example, in data summarization, influence maximization, and active learning).

Each problem was:

Simple but novel: within reach for a trained grad student.
Well-scoped: had a plausible solution pathway.
Checked carefully: the authors verified every proof attempt line by line.

The Setup: How the Experiments Worked

Feldman and Karbasi designed five conjectures. Each was inspired by prior work but deliberately crafted to be new.

They gave GPT-5:

The problem statement.
One or two relevant source papers.
Instructions to output a proof in LaTeX format.

Then they carefully assessed the reasoning.

The idea was to see whether GPT-5 could behave like a competent research assistant—able not just to adapt proofs from existing literature but to demonstrate original reasoning where necessary.

Godel Test: Can Large Language Models Solve Easy Conjectures?

GPT-5 vs. Five Conjectures

Here’s how GPT-5 performed on each problem.

Problem 1: Combining Monotone and Non-Monotone Functions

Task: Bound the performance of an algorithm maximizing the sum of two types of submodular functions under convex constraints.
GPT-5’s Answer: Adapted an existing Frank–Wolfe algorithm proof.
Outcome: Nearly correct, but the proof was lazy—skipped steps and clung too closely to prior work.
Verdict: Competent but unimaginative.

Problem 2: Bicriteria Maximization under p-System Constraints

Task: Extend bicriteria guarantees from matroids to p-systems (a broader class of constraints).
GPT-5’s Answer: Produced a different bound than the authors expected.
Outcome: The alternative bound actually refuted the authors’ conjecture while offering a stronger result.
Verdict: Genuine originality—a rare spark of insight.

Problem 3: Weakly-DR-Submodular Maximization

Task: Work with a relaxation of submodularity using a parameter γ.
GPT-5’s Answer: Designed a Frank–Wolfe-style algorithm with a (1 − e^−γ) approximation guarantee.
Outcome: Correct in essence but verbose and overly reliant on the cited reference.
Verdict: Solid but uninspired.

Problem 4: Weak Submodularity with Partial Monotonicity

Task: Combine two relaxations—weak submodularity and partial monotonicity.
GPT-5’s Answer: Initially gave only known results. When pushed, it attempted new ones.
Outcome: The proofs were wrong, riddled with subtle but fatal errors. They looked convincing on the surface but fell apart under scrutiny.
Verdict: Failure—dangerously plausible but incorrect.

Problem 5: A Harder, Open-Ended Case

Task: A problem the authors thought was easy, but turned out harder.
GPT-5’s Answer: Suggested the same algorithm as the authors but failed in the proof.
Outcome: Neither the humans nor the AI solved it—still open.
Verdict: Aligns with human reasoning, but incomplete.

Patterns: Where GPT-5 Shines and Where It Fails

Looking across the five problems, clear trends emerge.

Strengths:

Can adapt known proofs and produce LaTeX-ready outputs.
Handles straightforward reasoning chains well.
Occasionally generates surprising originality (as in Problem 2).

Weaknesses:

Struggles when multiple proof techniques must be combined.
Proofs often look correct but contain hidden flaws—a major trust issue.
Relies heavily on mimicking structures from reference papers.
Cannot yet perform integrative reasoning across diverse contexts.

In short, GPT-5 is like a serious but unimaginative graduate student—capable of competence, even the occasional insight, but not yet consistently trustworthy.

How This Fits into the Bigger Picture

This study is part of a broader wave of research probing AI’s ability to advance mathematics.

Sebastian Bubeck recently reported GPT-5 solving a convex optimization problem by improving a bound from 1/L to 1.5/L (link).
Diez, da Maia, and Nourdin (2025) studied GPT-5 on central limit theorem convergence proofs, showing it could make progress but required careful human oversight.

The common theme: AI shows glimmers of creativity, but its reasoning is still fragile.

What makes the Gödel Test unique is its focus on novelty. It doesn’t matter if the problem is “easy”—what matters is that it’s new. That’s the only way to measure whether AI is truly contributing knowledge rather than recycling it.

Why the Gödel Test Matters for AI Advancement

This paper is important for several reasons.

1. It reframes evaluation.

Current benchmarks test whether AI can match humans on known problems. The Gödel Test asks: Can AI take a step into the unknown?

2. It highlights both promise and peril.

GPT-5’s success on Problem 2 shows AI can occasionally outthink its human testers.
Its failure on Problem 4 shows how convincing but wrong proofs can be—an enormous risk if deployed without oversight.

3. It connects AI to scientific creativity.

Mathematics is the purest testbed for reasoning. If AI can contribute to it—even in small increments—it opens doors to physics, computer science, and beyond.

4. It sets the stage for new benchmarks.

Imagine a future where AI models are routinely tested not just on coding or exams, but on their ability to solve fresh conjectures in math, physics, or biology.

5. It underscores the need for human-AI collaboration.

AI may not yet be an autonomous mathematician, but it is edging closer to being a valuable research assistant—one that sparks ideas, explores variants, and occasionally uncovers overlooked insights.

Limitations of the Study

The authors are careful to acknowledge limitations:

Only five conjectures were tested—a small sample.
Only GPT-5 was evaluated—no comparison with Claude, Gemini, or DeepSeek.
Checking proofs was time-consuming, limiting scale.
Designing conjectures that are both novel and solvable is inherently tricky.

This means results are suggestive, not definitive. Still, they point the way forward.

Looking Ahead: What Comes After the Gödel Test?

The Gödel Test may become a new standard in evaluating AI’s reasoning. But it also highlights the tools that could accelerate progress:

Integration with proof assistants (Lean, Coq, Isabelle) for automatic verification.
Better prompting to push models toward more self-contained reasoning.
Cross-model testing to see whether originality generalizes across architectures.
Scaling up: running hundreds of conjectures to build statistical confidence.

Long-term, passing the Gödel Test could signal that AI is not just a tool for recalling knowledge, but a partner in creating it.

Conclusion: A First Step Into AI Originality

The Gödel Test paper is not claiming that GPT-5 is ready to revolutionize mathematics. But it shows something important:

On some problems, AI can reason competently.
On others, it can surprise us with originality.
And yet, it still often fails convincingly, underscoring the risks.

In that sense, the Gödel Test is more than a benchmark. It’s a philosophical and practical probe into whether AI can take the leap from imitation to creation.

We’re not there yet—but the path is becoming clearer.

📖 Full paper: Gödel Test: Can Large Language Models Solve Easy Conjectures?