Execution Systems

NVIDIA's New 120B Model Shows What 'Open' Really Means in 2026

8 min read · Published April 17, 2026 · Updated April 17, 2026

By CogLab Editorial Team · Reviewed by Knyckolas Sutherland

NVIDIA released Nemotron 3 Super this week. The number that caught headlines was 120 billion parameters. The number that matters for anyone actually running this in production was 12 billion active parameters at inference time.

That is the point of the mixture-of-experts design. You load the full 120 billion weights, but for any given token the model only uses about a tenth of them. The math on your GPU bill changes dramatically. You can run frontier-grade output on a single H200 instead of an eight-GPU cluster.

Nemotron 3 Super is also hybrid Mamba-Attention. Most of the layers are state-space. A few are attention. That combination lets the model handle long context without the quadratic cost blowup that pure attention suffers on long documents. You get 1 million tokens of context without the memory fireworks.

The license is the other story. NVIDIA published the weights under a permissive license that allows commercial use. You can fine-tune. You can deploy on your own hardware. You can fork it and ship a derivative. This is not a 'research preview' dressed up as open. It is actually the full model, with the full license.

Why does that matter for an operator? Because the practical definition of open has quietly shifted over the past year. A year ago 'open weights' meant you could download a research checkpoint and pray you had enough GPUs to load it. Today it means you can actually run the thing in your own stack, with your own privacy guarantees, for less than a subscription to a comparable closed model.

Nemotron 3 Super benchmarks close to Claude Opus 4.6 on coding evals and to GPT-5.4 on reasoning. It is not the absolute leader. It is the best model you can put behind your own VPC without a service agreement. That is the kind of distinction that matters when you have regulated data, an air-gapped customer, or a strong preference against sending your employee chat logs to a third party.

There is a pattern here that operators should internalize. Every six months, the frontier moves, and the open tier moves up to where the frontier was a cycle ago. Today's open weights roughly match last quarter's closed frontier. That gap is getting shorter, not longer.

What this means for product planning is that you should stop building roadmaps that assume the closed model is the only real option. Any feature you ship today on GPT-5.4 can probably ship on an open model within two quarters, with margins you control instead of margins you rent. If your unit economics depend on the inference cost dropping, start your own self-hosted pilot now, because the cost floor of running these models yourself is about to drop again.

There is also a quieter implication for vendors. If the open tier is good enough for most enterprise work, the closed labs have to price defensively on the workloads where they really shine. You see this already in OpenAI's enterprise pricing experiments and in Anthropic's push into domain-specific workflows. The models are not commodities yet. The cost of switching between them is getting much lower.

The takeaway for operators is simple. Put a real bet on portability. Do not architect your product as a thin wrapper on one vendor's API. Use an abstraction layer that lets you route the same request to a closed model, a self-hosted open model, and a specialized model. Then when Nemotron 4 drops in October, you are ready to run the evaluation instead of planning a migration.

Frequently Asked

What is a hybrid Mamba-Attention model?

A model architecture that mixes state-space layers (Mamba) with traditional attention layers. The state-space parts handle long context efficiently, and the attention parts handle tasks that benefit from full pairwise attention. You get long-context capability without the full attention cost.

Can I actually run a 120B model on one GPU?

With the mixture-of-experts design, yes. The full 120B weights need to be loaded in memory, which takes about 240GB at fp16, but an H200 has 141GB, and with 4-bit quantization Nemotron 3 Super fits on a single H200. Active inference only uses the 12B of currently selected experts.

How does this change my AI architecture decisions?

It should push you toward portability. Build your agent layer to treat the underlying model as swappable. Test your workflows on both open and closed tiers. Plan for the open tier to be good enough for a growing share of your workloads over the next year.

Sources

AI Strategy

OpenAI and Washington: The New Power Question

8 min read

AI Strategy

OpenAI and Washington: The New Power Question

8 min read

Services

Explore AI Coaching Programs

Solutions

Browse AI Systems by Team

Resources

Use Implementation Templates

Frequently Asked

What is a hybrid Mamba-Attention model?

Can I actually run a 120B model on one GPU?

How does this change my AI architecture decisions?

Sources

Related Articles