1. Introduction and Motivations
Generative AI systems today often default to straightforwardly presenting information rather than conducting dialogues that promote deep learner engagement. Recognizing that effective tutoring transcends mere fact-delivery, the Google LearnLM team embarked on a project to imbue the emergent Gemini model family with more robust pedagogical capabilities—capabilities reminiscent of a skillful human tutor. Their central premise: instead of fixating on one monolithic definition of “pedagogy,” developers, teachers, and end-users should be able to specify what “good pedagogy” means in each context, and the model should adapt accordingly. This dynamic perspective is particularly germane to educational technology, which must accommodate diverse grade levels, languages, cultural contexts, and philosophical stances.
In their previous report, the LearnLM team (Jurenka et al., 2024) delineated a broad vision for generative AI in education, outlining potential benefits, risks, and overarching philosophies for responsible deployment. That earlier work acknowledged the obstacles involved in developing a single “ideal” tutor-like AI: cost, maintenance, ever-evolving base models, and the fundamental challenge of bridging multiple educational styles across disciplines and cultures. This new paper, titled “LearnLM: Improving Gemini for Learning” (arXiv:2412.16429), concentrates on a refined approach: pedagogical instruction following, harnessing system-level instructions to cultivate learning-focused conversation behaviors.
A key shift since the earlier publication is that the new methodology eschews dogmatic definitions of how an AI tutor “must” behave—no single canonical approach is mandated. Instead, system instructions describe the desired interactive tutoring style, encouraging the model to adopt a flexible, context-tailored teaching persona. By blending these specialized pedagogical data into Gemini’s post-training mixture, researchers aim to improve the next generation of multimodal, multi-turn large language models for education. Indeed, the authors argue that learning behaviors are often in tension with standard open-domain generative capabilities, but they show how an instruction-following approach can reconcile these two objectives by ensuring the model does not “forget” core reasoning capabilities, factual knowledge, or safety constraints.
Through large-scale evaluations, the new LearnLM model, based on Gemini 1.5 Pro (specifically gemini-1.5-pro-002
, see Gemini Team et al., 2024), obtains consistently high ratings from expert pedagogy raters across varied tutoring scenarios. On metrics including tutor adaptivity, student engagement, correctness, and style, LearnLM exhibits a 31% preference over GPT-4o, an 11% preference over Claude 3.5, and a 13% preference over its own base, Gemini 1.5 Pro. The authors emphasize that these preference ratings reflect an October 2024 snapshot in a rapidly shifting landscape, as GPT-4o and Claude 3.5 also see periodic upgrades. Nonetheless, the results confirm that a targeted infusion of pedagogical data—backed by system instructions, conversation scenarios, and Reinforcement Learning from Human Feedback (RLHF)—can catapult the model’s teaching proficiency forward.
2. Pedagogical Instruction Following
2.1. Conceptual Foundations
Instruction following (IF) is a core mechanism for aligning large language models with user intentions. Historically, generative AI solutions might respond in open-ended ways to user questions or instructions, lacking the capacity to differentiate between “hard constraints” (e.g., “Stay within 100 words” or “Do not mention the correct solution yet.”) and “soft constraints” (e.g., “Offer a motivating, encouraging tone.”). In educational settings, these constraints can be intricate, contradictory, or ephemeral across contexts. A single teacher might want the AI to encourage Socratic questioning in one classroom, while a private tutor developer might prefer minimal direct hints in another.
Gemini (see arXiv:2403.05530) explicitly splits instructions into user instructions and system instructions. System instructions (i.e., the highest priority instructions that overshadow user-level requests) can be extremely detailed, specifying everything from persona styling to domain restrictions. The LearnLM approach extends that concept: they provide “pedagogical system instructions” that carefully lay out the tutoring style. Because not all tutoring constraints are easily verifiable, reliance on purely programmatic checks or purely soft instructions (e.g., “Use a warm tone.”) is insufficient. By collecting training examples seeded with these system instructions, the team ensures the model learns to interpret, prioritize, and consistently follow them.
2.2. Upgraded Post-Training With RLHF
In the original LearnLM tech report, the authors used straightforward Supervised Fine-Tuning (SFT) on a curated corpus of human-written and synthetic tutoring dialogues. But the second iteration significantly expands upon that approach. The updated pipeline includes:
- Revised SFT Data: All training examples begin with distinct “pedagogical system instructions,” describing desired tutoring methods, constraints, or persona details for that conversation. By having the model see a wide range of system-level instructions, it becomes more robust in following whichever subset of instructions a developer might supply.
- Human Preference Data & Reward Models: The authors collected preference judgments—where raters compared different responses to the same queries while also referencing system instructions. These judgments were used to train specialized reward models (RMs).
- Reinforcement Learning (RL): Instead of stopping at SFT, the authors apply RL using the newly trained RMs. This step is especially potent for fine-grained improvements: the model not only sees example dialogues but also internalizes the subtle distinctions raters make about the extent to which the system truly respects the system instructions.
Moreover, LearnLM forgoes pure “post-post-training” in isolation. Instead, it is co-trained alongside Gemini’s normal mix of SFT, RM training, and RL. That means LearnLM data is injected into the main pipeline, ensuring that the model retains and even expands upon its existing reasoning, factual, and multimodal capabilities. This holistic approach keeps LearnLM in sync with the base Gemini improvements rather than letting it drift away as Gemini evolves.
3. Scenario-Based Human Evaluation
3.1. Why Scenarios?
Evaluating an AI tutor is qualitatively different from measuring standard language modeling perplexity or single-turn question answering metrics. The authors stress that true educational dialogues can meander or switch complexity levels based on learner confusion or curiosity. Hence, LearnLM employs a scenario-based evaluation pipeline. Each scenario is a structured, multi-turn environment with:
- A specified learner persona (e.g., “A distracted tenth-grader who wants to skip steps.”).
- A learning goal (“Complete a geometry homework on circle theorems.”).
- Grounding materials (like an essay excerpt or an image of a homework question).
- System instructions that declare the style of tutoring, such as “Use only hints; do not reveal solutions unless the student attempts the problem,” or “Maintain a positive, encouraging tone but do not deviate from the subject matter if the student tries to talk about video games.”
Scenarios reduce confusion by making model comparisons consistent and fair. Each model sees the same initial user query, the same system instructions, and the same relevant context. Human participants role-play the learner’s side of the conversation in a carefully guided manner so that expert raters can assess the system’s quality under near-identical conditions.
3.2. Gathering Conversations
The evaluation pipeline collects actual dialogues in three steps:
- Scenario Curation: The team created a bank of 49 scenario templates across subjects ranging from high school algebra, computer science, and history to adult-level continuing education domains. These were derived from real-world use cases shared by teachers, non-profits, educational institutions, and internal Google product teams.
- Conversation Collection: For each scenario, a pool of 186 pedagogy experts (each with advanced academic degrees and tutoring experience) was instructed to role-play the indicated learner persona. They used the scenario’s initial query, system instructions, and grounding materials to interact with two different models (e.g., LearnLM vs. GPT-4o) in random order. They conversed for at least 10 turns.
- Conversation Rating: A separate pool of 248 pedagogy experts then read these dialogue transcripts. For each transcript pair (two different models, same scenario), the raters answered both single-model Likert questions (e.g., “This tutor helped the student discover their own mistakes.”) and comparative preference questions (e.g., “Which tutor better adhered to system instructions?”).
The resultant dataset included 2,360 conversations with 58,459 total messages from learner + model combined, plus 10,192 expert ratings.
3.3. Pedagogy Rubric and Comparison Measures
The authors built upon a rigorous set of metrics reflecting key teaching principles, grouped under “Manages cognitive load,” “Inspires active learning,” “Deepens metacognition,” “Stimulates curiosity,” and “Adapts to learner.” For instance:
- Manages cognitive load: The tutor should avoid extraneous detail, present information in smaller digestible chunks, and maintain clarity.
- Inspires active learning: The tutor should encourage student attempts, pose questions, avoid simply giving away solutions, and cultivate constructive problem-solving.
- Deepens metacognition: The tutor should identify and discuss errors, highlight partial correctness, and adapt the conversation so that the student can discover mistakes.
Additionally, rater surveys probed whether the tutor was “warm,” “competent,” or “encouraged interest in the topic.” Another dimension of evaluation was the side-by-side rating: “Which tutor is more like a good human tutor?” or “Which tutor better supports the student’s learning goal?”
The authors used Bayesian hierarchical modeling to interpret the distribution of these ratings without inflating confidence by ignoring repeated measures from the same rater or the same conversation. Qualitative analysis of open-ended comments from participants was also conducted.
4. Results and Insights
The authors detail comparisons of LearnLM (trained on gemini-1.5-pro-002
from 2024-09-24) against three other “flagship” LLMs from the time period 2024-10-01:
- GPT-4o (version 2024-08-06; see OpenAI Documentation)
- Claude 3.5 Sonnet (version 2024-06-20; see Anthropic’s Docs)
- Gemini 1.5 Pro (the base model from which LearnLM is derived)
4.1. Comparative Preferences
- LearnLM vs. GPT-4o: The largest preference margin found was a 31% advantage for LearnLM overall, and strong edges in “better pedagogy,” “closer to a good human tutor,” “better instruction following,” “better adaptation,” and “better support for the learning goal.”
- LearnLM vs. Claude 3.5 Sonnet: Expert pedagogy raters showed an 11% margin favoring LearnLM. The difference was smaller than with GPT-4o but still statistically credible.
- LearnLM vs. Gemini 1.5 Pro: A 13% margin in favor of LearnLM demonstrates that the specialized training data and RLHF focusing on pedagogy do indeed measurably improve the base model’s tutoring style.
In absolute terms, GPT-4o and Claude 3.5 also performed well, receiving fairly positive ratings on many items. Yet, LearnLM was consistently singled out for aspects like encouraging active learning, providing robust feedback, and preventing the conversation from drifting off-topic.
4.2. Per-Dimension Rubric Scores
On average, participants rated each system on a 7-point Likert scale for the following categories:
- Manages cognitive load: LearnLM topped the chart, indicating more concise chunking, better logic ordering, and fewer extraneous tangents or contradictions.
- Inspires active learning: LearnLM again scored best, credited for weaving in open-ended prompts and minimal direct answers. Claude 3.5 also performed well, while GPT-4o sometimes provided immediate solutions.
- Deepens metacognition: LearnLM excelled at giving constructive feedback, acknowledging correctness, and guiding the discovery of mistakes.
- Stimulates curiosity: LearnLM was consistently praised for inciting further interest, as measured by “curiosity stimulation” and “encouraging feedback.”
- Adapts to learner: Another advantage for LearnLM; it was more likely to detect frustration or confusion from the role-playing user and pivot its approach accordingly.
4.3. Subjective User Impressions
When the role-playing participants (from the conversation collection phase) reflected on their experiences, they reported robust willingness to continue learning with LearnLM, underscored by both enjoyment and perceived competence. GPT-4o occasionally seemed “less warm,” leading to less reported willingness to keep using it as a tutor.
In a deeper qualitative breakdown of preference explanations, the authors highlight emergent themes. For instance, participants who favored LearnLM said it “challenges the learner,” “keeps on topic,” and “does not just give away answers.” Participants who preferred other models sometimes pointed to “clarity” or “information quantity” as reasons, implying LearnLM might occasionally hold back or break the content into smaller pieces that some users found too incremental.
4.4. Safety and Responsibility
LearnLM inherits the same safety policies as Gemini, enforced at the system-instruction layer and at the RLHF stage. The authors reference the Gemini 1.5 technical report for a full model card summarizing data collection procedures, potential pitfalls, compliance measures, and constraints. Key disclaimers remain: despite improvements, system instructions that are unsound or malicious can produce undesirable behaviors, so content developers must adhere to safe usage guidelines.
5. Future Directions
The authors see LearnLM as a stepping stone. By design, LearnLM is an experimental model available on Google AI Studio for real-world feedback. Already, certain pedagogical enhancements discovered via LearnLM have begun merging into the next wave of Gemini models, including Gemini 2.0 (Pichai et al., 2024). Their broader roadmap includes:
- Refining the Pedagogical Rubric: They emphasize that their taxonomy, though grounded in recognized learning science principles (Kirschner and Hendrick, 2020), might not fully capture all cultural or contextual nuances. Building a more universal, community-endorsed pedagogy evaluation standard remains an ongoing challenge.
- Extrinsic Evaluation of Learning Outcomes: While the rating system measures how well the AI exhibits “tutor-like” qualities, the authors want to see if learners truly achieve better educational performance. They note that some pioneering studies show large language models can boost test prep performance or deliver successful short-term interventions. However, rigorous, longitudinal, extrinsic outcomes in real classrooms remain an important frontier.
- Expansion to Non-Academic and Specialized Domains: The team began a feasibility study applying LearnLM’s approach to “medical education” scenarios (Appendix C in the paper). Preliminary evidence suggests that medical students find LearnLM more helpful and more enjoyable than the baseline Gemini 1.5 Pro. However, specialized fields like medicine demand thorough scrutiny for factual accuracy, potential biases, or harmful omissions. The same logic extends to finance, law, vocational training, and beyond.
All told, the authors underscore their aspiration that specifying desired tutoring styles in system instructions should feel natural to domain experts and teachers, lowering the barrier to adopting AI tutoring solutions in myriad contexts. They also highlight that, from a technical perspective, co-training ensures synergy between general large language modeling progress and specialized pedagogical fine-tuning.
6. Case Study: Feasibility in Medical Education
One particularly illustrative example the paper provides appears in Appendix C, describing how LearnLM was tested with 50 unique scenarios covering subtopics from pediatrics to clinical diagnostics. The authors recruited 18 medical students, half in preclinical and half in clinical phases. They discovered that LearnLM garnered positive feedback relative to Gemini 1.5 Pro in terms of clarity, enjoyability, personal goal achievement, and overall experience—though the margin was largest on “enjoyability.”
They caution that because these are emergent technologies, real medical education usage demands an additional layer of domain-specific vetting for accuracy, potential harm, or biases. Tools that misdiagnose or underemphasize critical disclaimers can cause confusion or unsafe medical practice. But the pilot results point to the viability of systematic scenario-based tutoring in advanced domains like medicine, so long as robust oversight is in place.
7. Conclusion
In sum, LearnLM: Improving Gemini for Learning illuminates how focusing on pedagogical instruction following transforms a large, multimodal language model (Gemini) into a more dynamic, context-sensitive, and truly educational tutor. The core design principle is that by seeding conversations with system instructions that detail the desired teaching style, the model can seamlessly shift from typical open-domain chat to active tutoring strategies—posing more questions, encouraging the learner to show their work, and withholding direct answers until the student attempts problem-solving.
From a modeling perspective, the synergy of SFT + RLHF with systematically prepared “pedagogical system instructions” stands out as a pragmatic solution, especially when performed via co-training so that the refined model remains up-to-date with Gemini’s ongoing leaps in other capabilities (multimodal context, extended context window, chain-of-thought). From an evaluation standpoint, scenario-based rating fosters a repeatable, multi-turn environment that captures the complexities of actual teaching dialogues.
Quantitatively, the robust preference for LearnLM over GPT-4o, Claude 3.5, and even the Gemini 1.5 Pro baseline underscores the tangible payoff from the authors’ systematic approach. Qualitatively, the mix of textual analysis and user feedback suggests that LearnLM better avoids simply delivering solutions and instead spurs deep learner engagement, while also creating a more supportive environment. Yet, they openly acknowledge that more work remains: building community consensus on a universal pedagogy framework, exploring real-world learning outcomes (beyond subjective measures), and investigating specialized domains thoroughly.
Ultimately, the paper’s authors encourage developers, educators, and researchers to experiment with LearnLM (available on Google AI Studio), share feedback, and co-construct new forms of digital tutoring that can shape next-generation academic environments. They foresee a future in which specifying “the perfect tutor” is as simple as providing a multi-paragraph system instruction block, prompting the model to deliver precisely the style of teaching needed—nuanced, flexible, and empowering for learners.
References and Sources
Below are the references listed in the LearnLM paper, with relevant links reproduced verbatim or, where not originally hyperlinked, provided in the text:
- I. Jurenka et al. (2024). “Towards responsible development of generative AI for education: An evaluation-driven approach.” arXiv preprint arXiv:2407.12687.
- D. M. Ziegler et al. (2019). “Fine-tuning language models from human preferences.” arXiv preprint arXiv:1909.08593.
- Gemini Team et al. (2024). “Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context.” arXiv preprint arXiv:2403.05530.
- S. Pichai, D. Hassabis, and K. Kavukcuoglu (2024). “Introducing gemini 2.0: our new ai model for the agentic era.” https://blog.google/technology/google-deepmind/google-gemini-ai-update-december-2024/
- E. Mollick and L. Mollick (2023). “Assigning AI: Seven approaches for students, with prompts.” arXiv preprint arXiv:2306.10052.
- B. Wen et al. (2024). “Benchmarking complex instruction-following with multiple constraints composition.” arXiv preprint arXiv:2407.03978.
- J. Zhou et al. (2023). “Instruction-following evaluation for large language models.” arXiv preprint arXiv:2311.07911.
- Y. Qin et al. (2024). “InfoBench: Evaluating instruction following ability in large language models.” arXiv preprint arXiv:2401.03601.
- L. Ibrahim et al. (2024). “Beyond static AI evaluations: Advancing human interaction evaluations for LLM harms and risks.” arXiv preprint arXiv:2405.10632.
- W.-L. Chiang, T. Li, and A. Angelopoulos (2024). “Does style matter? Disentangling style and substance in chatbot arena.” https://blog.lmarena.ai/blog/2024/style-control/
- Team Gemini et al. (2023). “Gemini: A family of highly capable multimodal models.” arXiv preprint arXiv:2312.11805.
- R. E. Wang et al. (2024). “Tutor copilot: A human-ai approach for scaling real-time expertise.” arXiv preprint arXiv:2410.03017.
- H. Bastani et al. (2024). “Generative ai can harm learning.” SSRN 4895486.
- National Education Association (NEA). “Teaching and learning in the age of artificial intelligence.” https://www.nea.org/resource-library/artificial-intelligence-education/iv-teaching-and-learning-age-artificial-intelligence