A Chip of the Old Block
Google has a history of playing a high-stakes game of catch-up in the open-source AI space. With the release of Gemma 4, the company isn’t just catching up; it’s attempting to redefine the architecture of what we consider a small model. After spending a week with the new lineup, ranging from the mobile-optimized E2B to the beefier 31B Dense variant, it’s clear that Google DeepMind is prioritizing efficiency and intelligence-per-parameter over raw scale.
The standout feature of Gemma 4 is undoubtedly the new Per-Layer Embedding (PLE) system found in the edge models. In traditional transformers, a token gets one embedding at the start. Gemma 4’s E2B and E4B models inject a small residual signal into every single decoder layer. The result is a 2.3B parameter model that punches significantly above its weight class, often rivaling 7B or 8B models from just a year ago. In my testing, the E4B variant handled complex conditional logic and math problems, like calculating pro-rated savings goals, with a level of nuance that usually requires a much larger VRAM footprint.
When compared to the Qwen 3.5 family (specifically the 7B and 14B models), Gemma 4 feels more composed. While Qwen models are famous for their raw speed and expansive multilingual capabilities, Gemma 4 feels more Gemini-aligned. It follows instructions with a clinical precision that reduces the vibe-checking usually required with open-weights models. Its native multimodal support is also a highlight; the workstation models (26B MoE and 31B Dense) handle video inputs up to 60 seconds natively, a feat that usually requires a complex pipeline of separate vision encoders. In agentic workflows, Gemma’s cleaner output formatting consistently beat out Qwen, which can sometimes drift into thinking loops that delay the final answer.
The Weight Class Limitation
However, for all its architectural brilliance, Google is still playing it safe where it matters most to the pro community: scale. The Gemma 4 lineup effectively tops out at 31B parameters. While the 26B Mixture-of-Experts (MoE) model is an engineering marvel, offering the intelligence of a mid-sized model at the inference cost of a 4B model, it leaves a massive vacuum at the high end.
If you are looking for a true frontier-class open model to rival GPT-4o or Claude 3.5 Sonnet - or, indeed, Gemini 3.1 - Gemma 4 isn't trying to be that. Alibaba’s Qwen 3.5 235B (MoE) and other massive open-weights models like DeepSeek still hold the crown for the most complex reasoning and deep-subject expertise. By capping Gemma at 31B, Google is signaling that it views open as a tool for the edge and the workstation, while keeping the truly massive frontier intelligence behind the Gemini API curtain.
Gemma 4 is the undisputed king of the 12GB VRAM tier, but if you have a H100 cluster and need 400B-parameter reasoning, you’ll have to look elsewhere.
Furthermore, while the 128K to 256K context window is generous and holds up well against context rot, the KV cache requirements for Gemma 4 are notably higher than previous versions. On a consumer GPU like the RTX 5090, you might find yourself hitting VRAM ceilings much faster than expected when pushing that 256K limit. Ultimately, Gemma 4 is a triumph of optimization and a masterclass in making small feel big, but its refusal to compete in the heavyweight division keeps it as a specialized tool rather than a total replacement for the giants of the field.
Human