Execution Systems

Google's Gemma 4 Runs on a Single GPU and Changes the Math

7 min read · Published April 4, 2026 · Updated April 4, 2026

By CogLab Editorial Team · Reviewed by Knyckolas Sutherland

Google dropped Gemma 4 on Friday. The spec line that matters: it runs on a single 80GB GPU, which is a single A100 or H100, and scores close enough to models twenty times its size that the gap does not show up on the workloads most operators care about. Licensing is permissive. You can deploy it, fine-tune it, and ship products on top of it.

For the past year, the practical answer to 'can we self-host' has been no. The models that were good enough required multi-GPU clusters, specialized inference stacks, and operational teams with hyperscale experience. The math almost never worked for a mid-sized company. You were better off sending the requests to a provider and paying the per-token bill.

Gemma 4 changes that math. A single GPU is a unit most serious companies can acquire, rack, and run without a dedicated MLOps team. That means self-hosting is now an option for a much larger share of the market, and 'should we self-host' is a question worth revisiting even if you dismissed it six months ago.

Why are the reasons to self-host getting stronger? Three reasons. Data residency, because regulated industries increasingly need to keep model input and output inside their own environment. Cost control, because per-token pricing punishes heavy usage in ways a flat-rate hardware cost does not. And feature velocity, because self-hosting lets you fine-tune on your own data in ways the API vendors either do not support or charge a premium for.

Why are the reasons to use an API still strong? Three reasons also. The frontier moves, and a hosted API gets you the best model automatically the day it ships. Operational cost of running your own model includes people, not just hardware. And capability guarantees from a vendor are real contracts, whereas capability guarantees from a fine-tuned Gemma on your cluster are your own problem.

For an operator looking at this honestly, the right answer is almost always a hybrid. Keep the API for the workloads where you need absolute frontier capability. Run a self-hosted Gemma 4 or similar for workloads where the capability bar is lower and the volume is higher. Cost-model the split. For most mid-sized teams the hybrid saves a meaningful amount of money once monthly API spend gets above about fifty thousand dollars.

The Gemma release also matters for a specific kind of company. If you serve customers who have strong data-residency requirements, say a healthcare provider or a financial institution in a country with data-localization rules, Gemma 4 is now a real option for delivering your product without shipping customer data to a third-party API. That was a hard sell until this release. It is a much easier one today.

There is a second-order effect worth naming. Google is using the Gemma release partly to pressure OpenAI and Anthropic on enterprise pricing. If self-hosted Gemma is a real substitute for a meaningful share of API workloads, the labs have to price their enterprise tiers defensively. You will probably see quiet price moves on commoditized workloads over the next quarter.

The practical move for an operator this week is to set up a Gemma 4 test deployment on one GPU. Pass your internal eval set to it. Measure the gap against your current API vendor on the workloads where you care about quality. If the gap is smaller than the cost difference, you have an easy win. If it is larger, you at least have a baseline to compare against as Gemma and its peers improve.

The broader signal is that the cost of good-enough self-hosted AI is dropping faster than any other part of the stack. That trend favors teams that can actually run their own infrastructure. It puts pressure on the labs to differentiate at the top. And it opens up AI to a much wider range of companies than the API-only market currently serves.

Frequently Asked

Does Gemma 4 really run on one GPU?

Yes, at inference. An 80GB GPU like an H100 or A100 can load the Gemma 4 weights and serve requests. Training or fine-tuning at scale still benefits from multi-GPU setups, but for inference a single GPU is enough for most workloads under a few thousand concurrent users.

Is self-hosting Gemma 4 actually cheaper than an API for my use case?

It depends on volume. The break-even point is usually around fifty thousand dollars a month in API spend. Below that, the operational cost of running your own infrastructure usually overwhelms the per-token savings. Above it, the economics start to favor self-hosting.

What should I use Gemma 4 for specifically?

High-volume, capability-moderate workloads. Classification, summarization, data extraction, internal agents, customer support draft generation, and workflows where frontier-grade reasoning is not the bottleneck. Keep frontier models for the small share of queries that actually need that capability.

Sources

AI Strategy

OpenAI and Washington: The New Power Question

8 min read

AI Strategy

OpenAI and Washington: The New Power Question

8 min read

Services

Explore AI Coaching Programs

Solutions

Browse AI Systems by Team

Resources

Use Implementation Templates

Frequently Asked

Does Gemma 4 really run on one GPU?

Is self-hosting Gemma 4 actually cheaper than an API for my use case?

What should I use Gemma 4 for specifically?

Sources

Related Articles