Evolving Deeper LLM Thinking - Paper Summary

Large Language Models (LLMs) have demonstrated remarkable competence in a variety of tasks, ranging from composing coherent essays to generating viable solutions for intricate problems. However, harnessing their full potential frequently requires strategies that exploit inference-time compute more effectively. In the paper “Evolving Deeper LLM Thinking” by Kuang-Huei Lee et al. (2025), the authors present an evolutionary search framework—termed Mind Evolution—designed to improve LLM performance on a range of challenging natural language planning problems without the need for formal, domain-specific solvers. This framework systematically explores and refines candidate solutions in large, sometimes ill-defined, solution spaces, guided only by an evaluator that can verify solution quality. Crucially, the evaluator need not be a formal solver but merely a function that scores or critiques candidate outputs programmatically.

Below, we provide an extensive summary of the paper. The discussion traverses the motivation behind Mind Evolution, its connections to related work, the technical underpinnings of the evolutionary approach, its experimental evaluation on three major benchmarks (TravelPlanner, Trip Planning, and Meeting Planning) and a newly introduced challenge called StegPoet, as well as an in-depth assessment of how the approach compares to or ablates from simpler methods. Throughout, we track how Mind Evolution outperforms alternatives such as Best-of-N sampling or sequential revision, scaling more efficiently as inference-time resources grow. We conclude with a look at future directions and the broader significance of an evolutionary paradigm in LLM inference.

Evolving Deeper LLM Thinking Download

1. Motivation and Background

A pervasive question in current LLM research is how to spend inference-time compute effectively to improve problem-solving. Simply sampling multiple responses from an LLM can help (i.e., “Best-of-N”), but in many tasks, each solution candidate is fairly independent from the others. This misses opportunities to refine or recombine good partial ideas. Meanwhile, purely sequential approaches that refine a single solution step by step (e.g., “self-reflection” or single-trajectory “Reflexion”) can get stuck in local modes or require many evaluation calls. In either case, a more nuanced approach to search would generate multiple ideas, evaluate them, and systematically refine promising ones.

Enter Mind Evolution: an evolutionary search procedure that iteratively evolves a population of natural-language solution candidates. Drawing inspiration from genetic algorithms, it retains essential features such as:

Population-Based Search: Maintaining multiple solutions simultaneously.
Mutation and Crossover: Combining components of promising solutions (crossover) or modifying an individual solution (mutation).
Selection: Giving preference to high-scoring solutions while preserving diversity.
Parallelization: Potentially running multiple sub-populations, or “islands,” in parallel.

However, in contrast to classical genetic algorithms that often target formal, well-defined solution spaces (e.g., bit-strings or numeric vectors), Mind Evolution operates purely in natural language. This broadens applicability. Whenever a task’s solutions can be checked by a specialized “solution evaluator,” Mind Evolution can proceed without any formal solver or step-by-step symbolic reward. Indeed, the authors demonstrate that tasks as diverse as multi-day travel planning, multi-party meeting scheduling, or even steganographic text generation can all be tackled by the same evolutionary approach, provided that an evaluator can critique or reject incorrect solutions.

2. Related Work

2.1. Increasing Inference-Time Computation

The use of iterative refinement strategies to strengthen an LLM’s solutions is not new. Past work includes:

Chain-of-Thought prompting, which coaxes an LLM to break down its reasoning steps.
Self-Consistency sampling multiple chain-of-thought paths and aggregating.
Verification-Assisted Revisions in which an LLM checks its intermediate or final outputs, sometimes using a learned verifier.

However, these are mostly single-lineage expansions of solutions: the model “thinks aloud” or tries multiple independent thoughts that do not systematically “cross-fertilize.” That is, each chain-of-thought is not recombined with other chains in a structured manner.

2.2. Evolutionary Methods and LLMs

Evolutionary strategies paired with LLMs have appeared in multiple contexts. Several efforts exploit program synthesis or code generation with “execution-based feedback,” searching for code that passes certain tests. Yet these frameworks typically:

Focus on formal code spaces, leveraging an environment that can compile and run the code.
Rely on direct semantic checks (e.g., whether code produces the right output).

By contrast, Mind Evolution does not require formal code generation. Any natural-language representation of a candidate solution is permissible, and the system only needs an evaluator function to check correctness. This expands the scope to tasks like itinerary planning or puzzle-like problems that do not have straightforward formal definitions but do have a means of verifying correctness (e.g., a program that checks constraints on budget, scheduling, or message encoding).

2.3. Pairing LLMs with Execution or Evaluation Feedback

In code-related tasks, the notion of a “test suite as feedback loop” is well known. For more open-ended problems, some have trained neural verifiers or used LLM-based self-evaluation. But whenever such verifiers are imperfect, the feedback is less reliable, which can hamper performance. The authors emphasize a scenario where the feedback is unambiguous and programmatically verified. They demonstrate that, within domains that can be checked automatically, no explicit formal solver or translation pipeline is necessary. Mind Evolution thus avoids the complexities of translator-based or specialized solver-based pipelines (see, for example, which uses GPT-4 to transform natural language constraints into a formal planning instance and then calls a specialized solver).

3. Core Algorithmic Design: Mind Evolution

The authors frame Mind Evolution in the lineage of genetic algorithms (GA). Each candidate solution—expressed in unstructured natural language—forms an “individual” in the population. This population undergoes repeated cycles of:

Evaluation: Each solution’s fitness is computed by a domain-specific function that both returns a numerical score and textual feedback.
Selection: Based on these scores, a subset of solutions is chosen to be parents for the next generation.
Recombination (mutation & crossover): A Large Language Model is prompted to produce new candidate solutions, refining and merging parent solutions.
Population update: New candidates are added, possibly removing duplicates or worse-performing solutions, often with an “island model” structure.

3.1. Fitness Evaluation

Crucially, Mind Evolution depends on a custom function that can parse a candidate solution and identify how well it satisfies domain constraints. For example, in a trip-planning scenario:

The function checks if the solution’s total cost is under budget.
The function verifies whether the user’s preference for specific dining cuisines is honored.
If constraints fail, it indicates these shortfalls via textual feedback.

Though the logic to parse and evaluate a plan can be intricate, it is often more straightforward than engineering a solver from scratch. Once such an evaluator is coded, no domain-specific modeling is needed: Mind Evolution’s LLM-based prompts handle the rest.

3.2. Population Initialization

For each problem instance (e.g., “Plan a 5-day trip from Seattle to Los Angeles, ensuring at least one Japanese meal, etc.”), Mind Evolution creates an initial population by independently sampling several solutions from the LLM. Each solution is then refined for a small number of “turns” (often denoted by νseq\nu_{\mathrm{seq}}νseq), using the Refinement through Critical Conversation (RCC) approach. This procedure seeds a pool of candidate plans or schedules.

3.3. Refinement through Critical Conversation (RCC)

A novel twist is the use of two “characters” orchestrated via prompt engineering:

Critic: Analyzes the textual feedback from the evaluator, enumerates issues in the solution, and suggests how to fix them.
Author: Proposes an updated solution that addresses the Critic’s concerns.

At each turn, the Critic first “thinks” about the known issues, then the Author revises or merges solutions accordingly. This two-step pattern intentionally fosters deeper iterative reasoning (an approach reminiscent of “self-reflection”, yet carefully structured and seeded with explicit evaluation feedback). The paper notes that textual feedback from the evaluator is vital: it highlights exactly which constraints fail, giving the Critic direction. In an ablation study, removing the Critic role or removing the textual feedback each significantly degrade performance.

3.4. Selection

Mind Evolution then selects solutions in a manner akin to a “Boltzmann tournament”. This means solutions with higher fitness are more likely to be chosen but not guaranteed, preserving some diversity. A conversation can choose up to a certain number (νparent\nu_{\mathrm{parent}}νparent) of solutions as “parents,” giving them to the LLM for the next generation’s recombination.

3.5. Crossover and Mutation via LLM Recombination

In typical genetic algorithms, “crossover” merges segments of the parent solutions, while “mutation” changes parts randomly. In Mind Evolution, the LLM is simply prompted to incorporate and refine multiple parent solutions into one child. The Critic role in the RCC process (above) handles the merging: it synthesizes feedback and attempts to produce a coherent child solution that hopefully surpasses each parent in correctness. Because LLMs are strong at paraphrase and textual generation, the authors refer to this entire step (merging parent solutions and iteratively altering them) as “recombination.”

3.6. The Island Model

To boost diversity and parallelism, Mind Evolution employs an island model. Multiple sub-populations (“islands”) evolve independently, except that every few generations:

Migration: A few top solutions from Island iii are copied to Island i+1i+1i+1.
Island Reset: After a prescribed interval, the system identifies the worst-performing islands, retires their populations, and repopulates them using top solutions from the best global pool. Optionally, the LLM is invoked to filter or select a diverse subset from top candidates so that the new populations do not become too homogeneous.

As shown in ablation experiments, these resets and migrations help the algorithm avoid local maxima, leading to better success rates.

4. Empirical Evaluation on Natural Language Planning

The central claim is that Mind Evolution can tackle tasks purely from natural language instructions, as long as the solutions can be unambiguously checked. The authors test it on three established benchmarks plus a new one:

TravelPlanner: Plans multi-day itineraries with specified budgets, restaurants, accommodations, and constraints.
Natural Plan: Trip Planning: Creates travel sequences across multiple cities with flight connectivity constraints, day allocations, and special events.
Natural Plan: Meeting Planning: Schedules meetings with multiple individuals under location, travel time, and availability constraints.
StegPoet (newly introduced): Hides a numeric message in a piece of creative writing (such as a poem), verifying that the correct words appear in the correct order, and that constraints on style and spacing are met.

In each domain, a specialized evaluation function checks solutions for errors or objective-suboptimal aspects, returning a numeric penalty score plus textual feedback. For instance, in Meeting Planning, the score might penalize constraints not satisfied (like failing to meet a friend during their available window), and the text feedback pinpoints exactly what was missing or contradictory.

4.1. Baselines

The authors systematically compare Mind Evolution with:

1-Pass: A single forward pass from the LLM.
Best-of-N: Up to NNN independent solutions are sampled, returning the best among them that the evaluator finds correct.
Sequential-Revision+: The LLM refines each of several initially sampled solutions for many (up to 80) turns, treating them independently (akin to repeated self-reflection).

They keep track of the total cost in LLM calls and tokens consumed, as well as the fraction of tasks successfully solved (the “Success Rate”). They use these strategies across all tasks, controlling for a maximum of 800 solutions (or 800 calls, depending on the method) to ensure comparable compute consumption.

4.2. Results on TravelPlanner

TravelPlanner tasks range from 3-day “easy” constraints to 7-day “hard” constraints. The authors note that simpler approaches struggle due to the numerous implicit commonsense constraints (like not booking the same accommodation multiple times if the constraints forbid it, or ensuring day-by-day coherence). The results show:

1-Pass with Gemini 1.5 Flash solves only 5.6% of the validation set.
Best-of-N obtains 55.6% success (with an allowance of up to 800 samples).
Sequential-Revision+ can raise it to 82.8%.
Mind Evolution yields a remarkable 95.6% success on validation.

By systematically refining a population of solutions, Mind Evolution can detect constraints that Best-of-N might fail to address effectively. The authors also highlight a two-stage approach that escalates to a more powerful LLM (Gemini 1.5 Pro) if Mind Evolution with Gemini 1.5 Flash still fails after a set budget. This two-stage pipeline virtually solves 100% of validation tasks and 99.9% of test tasks, rivaling or exceeding the performance of a specialized formal solver approach from. Importantly, Mind Evolution requires no domain-specific formal solver—merely a textual evaluator.

4.3. Results on Natural Plan – Trip Planning

This dataset expects a sequence of city visits with constraints like flight connections and special events that must happen on specific days. The difficulty scales by the number of cities: from 3 to 10. The authors split the tasks into 320 validation and 1,280 test examples.

1-Pass gets about 20.6% (Flash).
Best-of-N improves significantly to 77.2% on the validation set.
Sequential-Revision+ gets 74.4%.
Mind Evolution scores 96.2% on validation and 94.1% on test.

Again, the more powerful two-stage method combined with Gemini 1.5 Pro solves effectively all instances (100% on validation, 99.6% on test). The authors comment that Best-of-N does fairly well in this domain because the constraints are relatively more straightforward (flight connectivity, day allocations) and do not rely on obscure commonsense knowledge. Still, Mind Evolution further refines solutions, especially for the larger city counts.

4.4. Results on Natural Plan – Meeting Planning

Scheduling a series of meetings is conceptually simpler than multi-day or multi-city trips but includes the objective of meeting as many friends as possible. That objective complicates single-pass solutions: deciding which subset of meetings is feasible while not violating time and location constraints requires nuanced exploration.

In summary:

1-Pass obtains a 20.8% success rate on validation and 44.2% for the stronger LLM baseline from OpenAI’s o1-preview.
Best-of-N yields 69.4%.
Sequential-Revision+ only reaches 62.0%.
Mind Evolution hits 85.0% on validation and 83.8% on test.

Once again, a two-stage approach pushing unsolved tasks to a bigger model yields near-comprehensive coverage (98.4% and 98.2% success). Critically, the authors note that this domain often cannot be trivially enumerated (like some simpler calendar scheduling might). The constraints can be subtle, and the best plan might skip certain individuals or reorder the location visits. Mind Evolution’s broad and deep search systematically outperforms simpler multi-sample or single-lineage revision methods.

4.5. Ablations and Scaling Behavior

The authors conduct controlled studies on TravelPlanner and the two Natural Plan tasks to isolate the effects of various design decisions. Key findings:

Role of Textual Feedback: Removing textual feedback from the evaluator, or not including a separate “Critic” role in the conversation, substantially drops success rates. The synergy between explicit critiques and an LLM that can fix them is crucial.
Island Model: Disabling the island model (i.e., using a single population) reduces final success. By extension, it is essential for maintaining diversity and parallel search.
Reset with LLM: In the Island Reset step, having the LLM pick top solutions with diversity in mind outperforms a purely numeric selection.
Number of Generations vs. Candidates: The authors show that distributing 800 candidate solutions across 10 generations of 80 new solutions per generation is better than a single generation with 800 solutions (or a single chain of 800 revisions). The repeated refine-and-select cycle is evidently beneficial.

In general, Mind Evolution’s performance scales well as more generations or more total solutions are allowed. The curves of success rate vs. computational cost (measured in LLM token usage) show that Mind Evolution typically dominates Best-of-N or sequential refinement in cost-efficiency.

5. The StegPoet Task: Encoding Messages in Poetry

To underscore the broad applicability of Mind Evolution, the authors propose a novel test called StegPoet. The user has a hidden numeric message—for example, a list of integers from 10 to 100—that must be mapped to unique “cipher words” inserted in a piece of creative text (e.g., a poem “in the style of Shel Silverstein”). Additional constraints include:

The poem must place each cipher word in the correct order, spaced by a certain average number of words (β\betaβ) so that the text is not just a blatant word list.
The cipher words must not appear anywhere else nor be repeated incorrectly.
The overall text must be “good” writing that fits a specified style or topic.

Though this task might be “silly” at first glance, it is nontrivial. The LLM must devise a consistent mapping from numbers to words, embed those words in a poem of natural style, all while respecting the spacing constraints. A specialized solution would presumably require building a custom steganographic generator. Instead, the authors rely on Mind Evolution and an evaluator that checks the correctness of the mapping. The results:

1-Pass cannot solve any instance, as the likelihood of correctly embedding all constraints is extremely low.
Best-of-N rarely finds a correct solution (only 1% success).
Sequential-Revision+ manages about 19.8%.
Mind Evolution jumps to 46.5% success.
With a two-stage approach, hooking unsolved cases to Gemini 1.5 Pro, success soars to 87%.

The complexity arises from the synergy of textual style constraints, numeric ordering constraints, and the requirement that repeated numbers appear in the poem with correct spacing. This example strongly demonstrates that Mind Evolution’s general mechanism—stochastic exploration plus iterative refinement based on an evaluator—supports tasks well beyond the usual “planning” realm, covering any domain where a verifying function can parse the output.

6. Analysis, Limitations, and Future Directions

The authors highlight several overarching lessons:

Search in Natural Language: Despite not having any formal grammar or domain, the iterative refine-and-critique approach works if the final answers can be parsed. This reduces the need to build specialized solvers or formal translations.
Broad+Deep Approach: Sampling a broad set of solutions (like Best-of-N) helps gather diverse ideas. Iteratively refining them (like reflexive methods) helps correct mistakes. By combining both in an evolutionary setting, Mind Evolution harnesses the benefits of breadth and depth.
High Reliability: On tasks like TravelPlanner or Trip Planning, the method solves nearly all instances if given enough attempts, especially when “falling back” to a more capable LLM on unsolved cases. This means one can cost-optimize by using a cheaper LLM first, then escalate only when necessary.

Nevertheless, the paper acknowledges key limitations:

The method hinges on the availability of a deterministic or near-deterministic “solution evaluator.” If no such programmatic method exists, the approach must rely on approximate or learned evaluators, which can be noisy or misaligned.
The technique can be more expensive than simpler single-pass generation. Each solution might require multiple calls to the LLM, especially for multi-turn refinement steps. However, as results show, the overhead is often offset by a much higher solve rate.
The approach is tested on tasks with straightforward ways to parse solutions (through JSON format or an annotated text format). For tasks with more ambiguous output formatting, one must design a robust parser or rely on the model to adhere to a strict template.

7. Conclusion

Mind Evolution elevates inference-time search from either naive broad sampling (Best-of-N) or purely linear iterative refinement (Reflexion-like approaches) to a population-based evolutionary framework. By leveraging the synergy of parallel candidate exploration, stepwise solution refinement, and selection guided by an automatic evaluator, the method achieves substantially higher success rates in a fraction of the calls or tokens that simpler strategies would require for comparable performance.

Its extensive success in TravelPlanner and Natural Plan tasks—areas in which prior attempts often required specialized solvers or ended in partial solutions—shows the strength of search in natural-language solution spaces. Furthermore, the introduction of the StegPoet challenge underscores that Mind Evolution is not bound to classical planning tasks alone; any domain with a stable method to parse, score, and critique solutions can benefit.

From a broader perspective, evolutionary algorithms share with human creativity the interplay of “divergent” and “convergent” thinking. Mind Evolution operationalizes this concept: it diverges by sampling and combining varied solutions, then converges by selecting and refining better ones. Such an approach resonates with the psychology of problem solving and suggests a new direction for harnessing large language models more effectively, even (or especially) when domain constraints are too complicated to encode in a formal solver.