Gemma 4 Is Here: Google's Most Powerful Open-Weight Model Family Yet

Google DeepMind just dropped its most ambitious open-weight release to date — and the implications for local AI, agentic workflows, and on-device deployment are significant.

On April 2, 2026, Google DeepMind officially announced Gemma 4, the newest generation of its open-weight model family. The release notes date the model weights themselves to March 31, 2026, but the public splash landed on April 2. Either way, the AI community has had very little time to absorb just how much has changed. This is not a minor point release.

Gemma 4 represents a fundamental rethinking of what an open-weight model family can look like: multimodal from the ground up, built for on-device and edge deployment, redesigned architecturally, and — critically — relicensed under Apache 2.0, a move that materially changes what enterprises and developers can do with these models.

Let’s go deep.

What Gemma 4 Actually Is

The short version: Gemma 4 is Google DeepMind’s new family of open-weight models built for reasoning, coding, agentic workflows, and multimodal understanding. Google calls it their “most intelligent open models to date.” But the longer and more interesting version is about the architecture of the release itself — the choices Google made about what sizes to build, how to handle multimodality, what license to ship under, and where they’re aiming these models.

Previous Gemma releases targeted developers and researchers who mostly wanted a capable open-weight text model they could fine-tune or run locally. Gemma 4 is more ambitious. Google is positioning this family as a full-spectrum deployment stack — from smartphones and edge devices all the way up to workstations and GPU servers. That’s a meaningful shift. It means the product thinking here is not just about benchmark scores but about where AI actually gets used, and Google clearly believes that more and more of that compute will happen off the cloud and close to the user.

To support that vision, Gemma 4 ships in four distinct model sizes, includes multimodal capabilities across the entire family (with audio support on the smaller models), supports 140+ pre-training languages, and integrates with an enormous ecosystem of inference frameworks from day one. It’s a complete release strategy, not just a model drop.

The Four Models: A Detailed Breakdown

Gemma 4 ships in four sizes, and understanding the naming convention is essential because the naming encodes meaningful architectural information.

Gemma 4 E2B

The E2B is the smallest model in the family. “E” stands for effective parameters — a naming choice Google made specifically to highlight that these models use Per-Layer Embeddings (PLE) to be more efficient than their raw parameter count might suggest. The E2B has 2.3 billion effective parameters, but 5.1 billion parameters total when embeddings are included. It supports a 128K token context window and handles text, image, and audio inputs. Memory requirements are approximately 9.6 GB in BF16 precision, 4.6 GB in SFP8, and 3.2 GB in Q4_0 quantization. At Q4_0, this model can run on a device with 4–5 GB of VRAM — including many consumer laptops and, potentially, capable smartphones.

Gemma 4 E4B

The E4B steps up to 4.5 billion effective parameters, 8 billion total with embeddings. It shares the same 128K context window as the E2B and also supports text, image, and audio. Memory requirements climb to 15 GB BF16, 7.5 GB SFP8, or 5 GB at Q4_0. This is still firmly in the “runs on a laptop” category when quantized, and Google’s benchmark numbers for the E4B are notably strong for its size — more on those later. Like the E2B, the “E” naming acknowledges that these models use Per-Layer Embeddings to squeeze more capability out of fewer active parameters, which is the key design philosophy for edge deployment.

Gemma 4 31B Dense

This is the flagship dense model. At 30.7 billion parameters with a 256K token context window, it supports text and image inputs (but not audio, unlike the smaller models). Memory requirements are substantial: 58.3 GB in BF16, 30.4 GB in SFP8, and 17.4 GB at Q4_0. Unquantized in BF16, the 31B fits on a single NVIDIA H100 80GB GPU. In quantized form, it can run on consumer-grade multi-GPU setups or high-end workstations. Google claims the 31B is currently ranked third among all open models on the Chatbot Arena leaderboard — an independent crowd-sourced ranking platform that has become one of the most trusted measures of real-world model quality. That’s a significant claim.

Gemma 4 26B A4B (Mixture-of-Experts)

This is arguably the most technically interesting model in the family. The “A” in A4B stands for active parameters — this is a Mixture-of-Experts (MoE) architecture, meaning that at inference time, only a subset of the total parameters are activated per token. Specifically, the 26B A4B has 25.2 billion total parameters, but only 3.8 billion are active during any given forward pass. Google achieves this through a routing mechanism that uses 8 active experts out of 128 total, plus 1 shared expert.

The result: inference cost that’s closer to a 4B model even though the model has far more total knowledge encoded in its weights. It supports a 256K context window, handles text and image, and requires approximately 48 GB in BF16, 25 GB in SFP8, or 15.6 GB at Q4_0. Google reports that this model currently sits at number 6 on the Arena leaderboard among open models, which means you have a model with roughly 4B-class inference speed competing with models that cost far more to run.

What’s New: The Feature List That Matters

Google’s documentation highlights several major upgrades over previous Gemma generations, and it’s worth going through each one carefully rather than treating them as a bullet list.

Configurable Thinking Modes

Gemma 4 instruction-tuned models include what Google calls “configurable thinking modes” — essentially, the ability to toggle step-by-step chain-of-thought reasoning on or off depending on whether you need deep analytical work or fast responses. This is similar to the thinking mode toggle in models like Claude and Gemini, and it matters because it gives developers fine-grained control over latency versus reasoning depth. For agentic workflows where some steps need careful reasoning and others just need a fast lookup or generation, this is a meaningful capability.

Function Calling and Structured Tool Use

Native function calling support is now built into the Gemma 4 family. This is a requirement for building reliable agentic systems — models that can call external APIs, query databases, use search, or interact with multi-step tool chains. The fact that this is built into a fully open-weight model family with an Apache 2.0 license is genuinely new: it means developers can build tool-use agentic systems without any dependency on proprietary model APIs, which has significant implications for privacy-sensitive applications and enterprises that cannot send data to external services.

Native System-Role Support

Previous Gemma versions had inconsistent or indirect handling of the system prompt / system role in the conversation structure. Gemma 4 adds native system-role support, which should make instruction control cleaner, more reliable, and more consistent across deployment frameworks. For anyone who has struggled with Gemma models drifting from system instructions or treating system prompts inconsistently, this is a welcome fix.

Extended Context Windows

The context window jump is real and substantial. Gemma 4 E2B and E4B support 128K tokens — up from 8K in earlier Gemma releases. The 31B and 26B A4B push to 256K tokens, which puts them in the same league as leading proprietary context models. To support long-context inference efficiently, Google uses Proportional RoPE (p-RoPE) on global attention layers combined with unified Key/Value caching, which reduces the memory overhead that normally makes very long context windows prohibitively expensive.

Multimodality: Text, Image, and Audio

Gemma 4 is multimodal across the entire family. The 31B and 26B A4B handle text and images. The E2B and E4B go further: they also handle audio input. Audio capabilities include automatic speech recognition (ASR) and speech-to-translated-text translation across multiple languages.

On the vision side, Google says Gemma 4 supports variable aspect ratios and resolutions, and the vision tasks supported include optical character recognition (OCR), chart and graph understanding, document and PDF parsing, handwriting recognition, and screen and UI understanding. Video is handled by processing sequences of frames, meaning video understanding is architecturally possible even though it’s not a dedicated modality.

Multilingual Support

Gemma 4 was pre-trained on 140+ languages, with 35+ languages actively supported for instruction-following and fine-tuning. This makes it a genuinely global model, not an English-first model with occasional multilingual capability bolted on. For applications serving non-English speaking users — or for anyone building multilingual agents — this is a major upgrade.

Architecture: What’s Under the Hood

Google has published enough detail about Gemma 4’s architecture to understand the key technical decisions, and several of them are worth examining.

Hybrid Attention Design

Gemma 4 uses a hybrid attention architecture that alternates between local sliding-window attention and global full-context attention layers. Local sliding-window attention is computationally efficient but limits how far back in the sequence a given token can “see.” Global attention is the standard full self-attention mechanism that lets every token attend to every other token.

By alternating between the two, Gemma 4 gets most of its efficiency from the local layers while retaining full-context awareness at the global layers. Critically, Google specifies that the last layer is always global — ensuring that the final representation of a sequence has access to the full context.

This hybrid design is similar in spirit to architectures explored in models like Mistral’s sliding window attention and Google’s own research, but the specific combination here — with Proportional RoPE on the global layers — is tuned for the very long context windows Gemma 4 is targeting.

Per-Layer Embeddings (PLE) in the E-Models

The E2B and E4B models introduce Per-Layer Embeddings, which Google uses to improve parameter efficiency for on-device deployment. Rather than sharing a single embedding table across the entire model depth, PLE allows different layers to use distinct embedding representations. This increases the model’s expressive capacity without proportionally increasing the parameter count that matters for inference cost, which is why Google brands these as “effective parameter” models. The 2.3B effective parameters in the E2B, operating within a 5.1B total parameter model, reflect this design.

Mixture-of-Experts in the 26B A4B

The 26B A4B’s MoE design deserves particular attention. With 128 total experts and 8 active per forward pass — plus 1 shared expert that’s always active — the routing mechanism must efficiently select which experts to activate for each token. Google says the active parameter count is approximately 3.8 billion, meaning the effective inference cost per token is dramatically lower than the 25.2B total parameters would suggest.

MoE models have historically been difficult to run efficiently at inference time due to sparse routing requiring all expert weights to be loaded into memory even when most are inactive. Google’s hardware guidance — which lists the 26B A4B as requiring 48 GB BF16, only slightly more efficient than the 31B’s 58.3 GB — suggests this is an area where quantization makes the biggest practical difference.

Benchmark Performance: The Numbers in Detail

Google’s benchmark results for the Gemma 4 instruction-tuned models are strong, and in several cases, surprising for the model size.

31B Dense (Instruction-Tuned)

MMLU Pro: 85.2%
AIME 2026 (no tools): 89.2%
LiveCodeBench v6: 80.0%
GPQA Diamond: 84.3%
MMMU Pro (vision): 76.9%
MRCR v2 8-needle 128K (long context): 66.4%

The AIME 2026 score of 89.2% without tools is particularly notable — AIME (American Invitational Mathematics Examination) problems require sophisticated multi-step mathematical reasoning, and this score puts the 31B in elite territory for math reasoning without external tool assistance.

26B A4B MoE (Instruction-Tuned)

MMLU Pro: 82.6%
AIME 2026 (no tools): 88.3%
LiveCodeBench v6: 77.1%
GPQA Diamond: 82.3%
MMMU Pro (vision): 73.8%
MRCR v2 8-needle 128K: 44.1%

The 26B A4B’s results are only modestly behind the 31B on most benchmarks, while costing only about 4B effective parameters per forward pass. That ratio — near-31B quality at roughly 4B inference cost — is the core value proposition of the MoE design.

E4B (Instruction-Tuned)

MMLU Pro: 69.4%
LiveCodeBench v6: 52.0%

A 69.4% MMLU Pro score from a model that fits in 5 GB at Q4_0 is legitimately impressive. For comparison, this is competitive with much larger dense models from just 18 months ago.

E2B (Instruction-Tuned)

MMLU Pro: 60.0%
LiveCodeBench v6: 44.0%

The E2B hits 60% MMLU Pro at roughly 3.2 GB quantized. This is a model that Google is explicitly targeting at phones and browsers. A 60% MMLU Pro score at that footprint is the kind of number that accelerates the timeline on genuinely capable on-device AI.

Google’s headline claim — that Gemma 4 can outperform models 20x larger on the Arena leaderboard — is contextualized by these numbers. The 31B ranking third among open models, ahead of models that are 200B+ parameters, speaks to the quality-per-parameter efficiency improvements in this generation.

Hardware Requirements: What You Actually Need to Run It

Google published approximate inference memory requirements for Gemma 4 across three precision levels, and this information is practically useful for anyone planning local deployment.

Model	BF16	SFP8	Q4_0
E2B	9.6 GB	4.6 GB	3.2 GB
E4B	15 GB	7.5 GB	5 GB
31B	58.3 GB	30.4 GB	17.4 GB
26B A4B	48 GB	25 GB	15.6 GB

The takeaways here are important:

The E2B at Q4_0 fits comfortably in 4 GB of VRAM, which means a recent MacBook Air with 8 GB unified memory can run it with headroom for the OS and other applications. Google’s Android Studio announcement specifically lists 8 GB total RAM as the recommended minimum for running E2B in local agentic coding workflows on Android.

The E4B at Q4_0 fits in 5 GB — still within the range of consumer laptop hardware, and easily within a 12 GB VRAM GPU like an RTX 4070.

The 31B in BF16 fits on a single NVIDIA H100 80GB, and in Q4_0 at 17.4 GB, it fits on a single RTX 4090 (24 GB) with memory to spare — or on an M2/M3 MacBook Pro with 24 GB or more unified memory.

The 26B A4B at Q4_0 requires 15.6 GB, which is also within range of a single RTX 4090 or high-end Apple Silicon Mac.

For context, the Android Studio Gemma 4 developer post recommends 8 GB RAM for E2B, 12 GB for E4B, and 24 GB for the 26B MoE for local agentic coding workflows — a telling set of targets that signals Google is genuinely trying to make the smaller models work on mainstream developer hardware.

Where to Download and Try Gemma 4 Right Now

Google has made Gemma 4 immediately accessible through multiple channels.

Hugging Face

The model weights are available on Hugging Face under the Google organization. Both pre-trained base models and instruction-tuned variants are available for all four sizes. Hugging Face’s Transformers library has day-one support, meaning you can load and run Gemma 4 models with the standard from_pretrained pipeline.

Kaggle

Model weights are also available on Kaggle, where Google has historically published Gemma releases alongside Hugging Face. Kaggle’s free GPU tiers make it a good option for experimenting with the larger models without requiring local hardware.

Google AI Studio

The 31B and 26B A4B instruction-tuned models are available to try directly in Google AI Studio — Google’s web-based interface for experimenting with AI models. This is the fastest way to interact with the flagship models without any local setup.

Google AI Edge Gallery

The E4B and E2B models are available in Google AI Edge Gallery, which is Google’s platform for on-device model deployment. This is the intended entry point for developers building mobile and edge applications with the smaller Gemma 4 models.

Ecosystem: Day-One Framework Support

The breadth of day-one ecosystem support for Gemma 4 is worth noting explicitly because it addresses one of the historical pain points of open-weight model releases — the lag between when weights drop and when the inference stack actually supports them well.

Google has confirmed day-one support across the following frameworks and tools:

Ollama — the most popular local model runner for macOS, Windows, and Linux users who want one-line model deployment
LM Studio — the GUI-based local model runner that’s become a standard tool for non-technical users exploring local AI
llama.cpp — the foundational C++ inference library that powers most quantized local model running
Gemma.cpp — Google’s own C++ inference library for Gemma models
MLX — Apple’s machine learning framework optimized for Apple Silicon, which means E2B and E4B should run efficiently on macOS
vLLM — the high-throughput production inference engine widely used for serving LLMs at scale
Unsloth — the popular fine-tuning framework that has become the standard for efficient LoRA and QLoRA training
Keras — Google’s high-level deep learning API
LiteRT-LM and MediaPipe LLM Inference API — Google’s mobile/edge inference runtimes
Vertex AI, Cloud Run, and GKE — Google Cloud’s managed AI infrastructure stack

Having all of these on day one means that whether you’re a hobbyist running Ollama on a laptop, a researcher fine-tuning with Unsloth, a startup deploying with vLLM, or an enterprise team using Vertex AI, you can integrate Gemma 4 immediately without waiting for framework updates.

The License Change: Why Apache 2.0 Actually Matters

This deserves its own section because it may be the most consequential change in the entire release for commercial adoption.

Previous Gemma models shipped under a custom Google Gemma Terms of Use that, while relatively permissive, imposed restrictions that made enterprise legal teams uncomfortable — particularly around commercial use at scale, redistribution, and model modification. This license was not OSI-approved and created ambiguity for many organizations that have blanket policies around approved open-source licenses.

Gemma 4 ships under Apache 2.0. This is a well-understood, OSI-approved, commercially permissive license with no restrictions on commercial use, redistribution, or modification beyond attribution requirements. For enterprise teams, this change removes a major friction point. Apache 2.0 is already on most organizations’ pre-approved open-source license lists. It means legal review of Gemma 4 for enterprise deployment will be dramatically faster and simpler than for prior Gemma models.

Combined with the capability improvements in Gemma 4, the license change makes a strong argument for enterprises that have been waiting on the sidelines of local model deployment to take another look. The calculus has changed: you can now deploy a top-3 open model on your own infrastructure, without data leaving your network, under a license that your legal team already approves of.

The Two Biggest Storylines

Stepping back from the specifics, two narratives stand out from this release.

The edge AI push is serious.

Google has clearly invested heavily in making the E2B and E4B models work well on consumer and mobile hardware. The Per-Layer Embeddings design, the audio input support at small sizes, the 128K context window on tiny models, the day-one support in LiteRT-LM and MediaPipe — all of these are signals that Google is building for a world where AI runs on your device, not in a data center. The Android Studio integration is particularly telling: Google is putting Gemma 4 directly into the tools developers use to build Android apps, making local on-device AI a first-class citizen of Android development.

The quality-per-parameter gap between open and closed models is closing fast.

A 31B model ranked third among all open models on the Arena leaderboard, with an 89.2% score on AIME 2026 math problems without tools, is not a model that closed-source players can ignore. The 26B A4B running at 4B-class inference cost while hitting near-31B benchmark quality is an extraordinary value proposition. Google’s claim that Gemma 4 can outperform models 20x its size is no longer just marketing — it’s a benchmark claim that independent researchers will be able to verify within days.

What This Means for Developers and Builders

If you are building anything that involves local AI, on-device AI, agentic systems, multimodal applications, or privacy-sensitive AI deployments, Gemma 4 demands your immediate attention.

For local AI hobbyists and researchers: Download E4B or the 26B A4B from Hugging Face today. Pull them into Ollama or LM Studio. The benchmark numbers suggest you’re going to be surprised.

For fine-tuning practitioners: The Apache 2.0 license and Unsloth day-one support mean you can start building custom fine-tunes immediately with no licensing concerns about the resulting model weights.

For enterprise teams: The Apache 2.0 license change is your signal to move this up the evaluation queue. The 31B on a single H100 with 256K context and native function calling is a serious private deployment option.

For mobile and edge developers: The Google AI Edge Gallery and Android Studio integrations are your on-ramp. E2B at 3.2 GB Q4_0 with audio input and 128K context is not a toy — it’s a production-grade mobile AI model.

For agentic workflow builders: The combination of native function calling, configurable thinking modes, native system-role support, and long context makes Gemma 4 the first fully open-weight model family that has all the standard requirements for production agentic systems — without a proprietary API in the loop.

Final Assessment

Gemma 4 is a genuine step-change release. It is not Google playing catch-up or releasing a benchmark-optimized model that falls apart in practice. The breadth of the release — four model sizes, full multimodality, a rearchitected attention design, 140+ language pre-training, day-one ecosystem support across a dozen frameworks, and an Apache 2.0 license — reflects months of coordinated engineering across DeepMind, Google Cloud, and Google’s developer platforms teams.

The 31B and 26B A4B are the headline performers, and their Arena leaderboard positions will attract the most attention in the first news cycle. But in the longer run, the E2B and E4B may prove more disruptive. A 128K-context multimodal model with audio input, function calling, and a 3.2 GB quantized footprint is exactly what the on-device AI ecosystem has been waiting for — and Google has handed it to the world under Apache 2.0.

The local AI moment just got meaningfully more real.