Back to blog

Execution Systems

Claude Opus 4.7 Hits 87.6 on SWE-bench and the Coding Race Gets a New Floor

8 min read · Published April 16, 2026 · Updated April 16, 2026

By CogLab Editorial Team · Reviewed by Knyckolas Sutherland

Anthropic shipped Claude Opus 4.7 today. The headline number is 87.6% on SWE-bench Verified, which is the benchmark that tests whether a model can solve real open-source bug reports. Six months ago, nothing was past 70. A year ago, 50 looked ambitious.

Opus 4.7 also posted 94.2% on GPQA, which is the graduate-level science question benchmark where the point is to test whether the model can reason through things no one bothered writing down. That number used to be a signal that a model could hold its own against a PhD on their worst day. Now it is table stakes for the top tier.

The spec bump also includes a 1 million token context window and a vision upgrade that operates at 3.3 times higher resolution than the prior version. Pricing stayed at five dollars per million input tokens and twenty-five per million output. That is the same cost as Opus 4.6 for what is measurably a stronger model.

Why does a benchmark update matter to anyone who is not in the AI industry? Because SWE-bench is the closest thing we have to a test of 'can this model actually do my engineering team's job.' Every point of improvement on that benchmark corresponds to a certain percentage of real tickets a model can close without a human stepping in to fix its work.

At 87.6%, Opus 4.7 is plausibly closing five out of every six simple-to-medium engineering tasks unsupervised. For a shop that is still triaging which internal tools deserve AI help and which do not, that number should prompt a quiet reassessment. The coding agent you evaluated in January and rejected because it could not close tickets cleanly is probably a different product now.

The price staying flat is the move that actually pressures the market. Anthropic is signaling that model improvements are no longer going to cost more. That is the same move Amazon made on AWS pricing during the middle of cloud's growth curve. Cheaper compute at the same or higher capability keeps the platform growing and keeps competitors scrambling.

If you lead an engineering team, this is a week to re-run your internal evaluation. Take the same ticket set you tried three months ago. Pass it to Opus 4.7 and measure close rate and cycle time. If the numbers are meaningfully different, the decision to fund a real agent rollout stops being a bet on future capability and starts being a math problem about current capability.

The other shift that operators should notice is the context window. One million tokens sounds like a buzzword. What it actually means is that you can fit the whole relevant codebase, your ticket backlog, and the last month of Slack conversation into a single prompt. The agent does not have to guess which of those files is relevant. It can read all of them.

That changes the shape of how you use an agent. You stop writing 'find the bug in this function' prompts. You start writing 'here is our whole repository, here is the failing test, figure out what is wrong.' The model with that much room can do actual investigation work, not just pattern-match on the slice you hand-fed it.

The risk for operators is the opposite of the one people usually name. The model is rarely the bottleneck. The real bottleneck is that your organization has not built the scaffolding to benefit from a stronger model. If your agent rollout is stuck because nobody has defined what success looks like, nobody owns the evaluation, and nobody has the authority to merge an agent's pull request, a better model will not unstick you. Opus 4.7 just raises the cost of keeping that dysfunction in place.

Frequently Asked

What is SWE-bench Verified and why is 87.6 a big deal?

SWE-bench Verified is a benchmark of real bug reports from open-source repositories. The model has to read the issue, find the right files, and write a patch that passes the existing tests. 87.6% means about six out of every seven attempts succeed end to end, which is close to human engineer performance.

Does pricing staying flat mean models will keep getting better without costing more?

At the top end of the market, roughly yes. The underlying compute cost per token keeps dropping faster than the capability gains demand. The cheap tier models (Haiku, nano) have been dropping in price even as they improve. The flagship tier has held price while improving.

How should an engineering leader act on this?

Re-run any recent agent evaluation with the new model. Update your assumptions about close rate and cycle time. If the numbers now clear your internal bar, move the conversation from 'should we try this' to 'how do we roll it out safely.'

Sources

Related Articles

Services

Explore AI Coaching Programs

Solutions

Browse AI Systems by Team

Resources

Use Implementation Templates