April 2, 2026. Google releases Gemma 4. Within 24 hours thousands of community tests are running, benchmarks get torn apart, and Reddit debates whether open-source AI just reached its turning point. Short answer: almost.
What Gemma 4 Is
Four models, two architectures — from Android phones to workstation GPUs:
| Model | Active Parameters | Context | Note |
|---|---|---|---|
| E2B | 2.3B | 128K | Runs on an i7 laptop, incl. audio |
| E4B | 4.5B | 128K | Edge device, native audio input |
| 26B MoE | 3.8B active / 26B total | 256K | 128 experts, only 3.8B per token |
| 31B Dense | 30.7B | 256K | Flagship, #3 worldwide |
All four models process text, images, and video natively. The smaller ones additionally handle audio — which the larger ones cannot, an unusual design decision.
License: Apache 2.0. More on that in a moment.
The Numbers You Need to Know
Gemma 3 was a solid model. It won no category. Gemma 4 is a different story:
| Benchmark | Gemma 3 27B | Gemma 4 31B | Jump |
|---|---|---|---|
| AIME 2026 (Math) | 20.8% | 89.2% | +68 points |
| GPQA Diamond (Reasoning) | 42.4% | 84.3% | +42 points |
| LiveCodeBench v6 (Coding) | 29.1% | 80.0% | +51 points |
| Codeforces ELO | 110 | 2,150 | +2,040 points |
| MMMLU (Multilingual) | 70.7% | 88.4% | +18 points |
These are not incremental improvements. On the math olympiad benchmark (AIME), the score more than quadrupled. On Competitive Programming ELO, Gemma 4 jumped from “barely functional” to expert level.
The 26B MoE model is the technically more interesting number: it reaches 97% of the 31B flagship’s quality — with only 3.8B active parameters per inference step. 8x less compute for 3% less quality.
The Real News: Apache 2.0
Benchmark jumps come and go. What really counts in Gemma 4 is in the license header.
Gemma 3 ran under a custom Google license. Technically “open,” but with clauses giving Google the right to terminate access — and to change terms unilaterally. Enterprise legal teams saw this and often blocked Gemma 3. Too much uncertainty for a product built on top of the model.
Qwen from Alibaba and Mistral had been on Apache 2.0 for longer. They won enterprise deals that Gemma lost because of it.
Gemma 4 under Apache 2.0 changes that. License terms are frozen with the release — Google can no longer change them retroactively. No more legal review needed, no risk assessment. For companies running AI products on-premise or in regulated environments, this is the decisive difference.
Clement Delangue, CEO of Hugging Face, called it a “huge milestone” — which he doesn’t usually write.
Can Gemma 4 Compete with Claude or GPT-5?
Directly: partially — and that’s exactly what’s interesting.
Gemma 4 31B sits at an LMArena score of ~1,452 on Arena AI, comparable to GPT-5-mini and above GPT-OSS-120B. At coding (Codeforces ELO 2,150) it beats virtually all open-source models.
It doesn’t catch Claude Opus 4.6 or GPT-5. Those are frontier models with significantly more parameters, massively more training compute, and years of optimization for complex reasoning chains. Gemma 4 doesn’t change that.
The realistic comparison: Gemma 4 31B is a direct competitor to Claude Haiku or GPT-4o mini — at zero cost per query, deployable on-premise, without data privacy concerns, without rate limits.
Anyone currently using Claude Haiku for internal tools or batch processing and feeling the API costs: Gemma 4 is a concrete alternative that makes financial sense.
What the Community Found After 24 Hours
The benchmarks check out. But practice also showed some rough edges.
Inference speed with MoE: Multiple users report 11 tokens/second for the 26B MoE model on an RTX 5060 Ti — compared to 60+ tokens/second for Qwen 3.5 35B on the same card. The MoE design with 128 small experts is apparently not yet optimally tuned for current consumer hardware.
VRAM consumption: Gemma 4 needs more VRAM at long contexts than Qwen 3.5 at the same quantization. Using the 256K context window in practice requires more hardware than the parameter count suggests.
Day-0 fine-tuning: Hugging Face Transformers didn’t recognize the new gemma4 architecture (workaround: install from source). PEFT had issues with new layer types in the vision encoder. Issues and PRs came within hours — but anyone wanting to fine-tune immediately had work to do.
Multilingual: Surprisingly strong. Tests in German, Arabic, and Vietnamese show better results than Qwen 3.5. A real advantage for global deployments.
What’s still missing: QAT versions (quantization-aware training) — these came weeks after Gemma 3’s release and significantly improve quantized models. Also no 9-12B dense model, leaving a gap for users with mid-range GPUs.
What This Means for You
If you’re planning or operating any of the following today:
- Internal AI tools with data privacy requirements: Gemma 4 E4B or 31B locally on a single server.
- Agentic workflows (function calling, structured JSON): Natively supported, no prompting workarounds needed.
- Multilingual applications: Strongest open-source option in the space, better than Qwen 3.5 for European languages.
- Edge / mobile AI: E2B runs on an i7 laptop with 32 GB RAM at usable latency.
For coding assistants on your own server, Gemma 4 31B is production-ready today — with limitations in fine-tuning tooling that will stabilize over the coming weeks.
Conclusion
Gemma 4 is the strongest open-source model Google has ever released. The benchmark jumps from Gemma 3 to 4 are the largest we’ve seen in a single generation in the open-source space.
It doesn’t replace frontier models like Claude Opus or GPT-5. For local deployments, batch processing, internal tools, and on-premise requirements, it’s the first choice going forward — not because there are no alternatives, but because the combination of quality, Apache 2.0 license, and hardware efficiency hasn’t existed like this in open source before.
The real shift isn’t technical. It’s that Google has for the first time released a model with the same rights as any other open-source code on your server.
Planning to run AI models in your infrastructure? We help with evaluation, deployment, and integration — on-premise or in the cloud. Get in touch.