Is Claude Opus 4.7 Actually Better, or Just Different?
Anthropic shipped Claude Opus 4.7 on April 16, and a week later the community is still arguing about whether it's actually an upgrade. The benchmarks look great. The real-world experience is more complicated.
The good stuff is real
SWE-bench Pro went from 53.4% on Opus 4.6 to 64.3%. SWE-bench Verified jumped from 80.8% to 87.6%. If you're doing long agentic refactors or letting Claude grind through a multi-file PR, you'll probably feel it. People who lived inside Opus 4.6 for coding are mostly happy.
Vision tasks got a noticeable bump too. The new Claude Design feature has been all over X for a reason.
Then it gets weirder
Almost every thread comes back to the same handful of complaints:
- It eats tokens. The most repeated take on Reddit and X is that Opus 4.7 is a "token guzzler." Same task, more tokens, sometimes 1.5–3x more in practice. Even with flat list pricing, your bill goes up.
- It takes you literally. This one's actually called out in Anthropic's own migration guide: 4.7 follows instructions much more literally than 4.6. The "ambiguity tax" is real. Prompts that worked fine before, where 4.6 would quietly figure out what you meant, now produce surprising results because 4.7 just does exactly what you said.
- Long-context regression. Needle-in-a-haystack-style retrieval from the middle of large contexts looks worse than 4.6 by several reports. If your workflow leans on dropping massive context and trusting recall, watch for this.
- The default style shifted. For non-coding work, especially creative writing and exploratory chat, a chunk of users feel the default output is flatter than 4.6. Some of this might be the "Adaptive" vs explicit thinking-mode change in the web app.
So is it nerfed?
Depends entirely on what you use it for.
- Agentic coding, big refactors, vision: probably an upgrade. Vote accordingly.
- Creative writing, loose conversational prompts, long-context recall: plenty of people are saying it feels like a sidegrade or worse.
This is exactly the pattern we keep seeing with model updates. A new version isn't uniformly better or worse. It's better at some things and worse at others, and which category you fall into depends on how you use it. That's also why a single benchmark number never settles the "is it nerfed?" debate.
Help us track it
If you've been using Opus 4.7 for a few days, your gut take is data. Head to Claude's page and vote. The more people weigh in, the cleaner the signal gets, and the easier it is to spot if 4.7's "literal mode" or token appetite quietly shifts week-to-week.
And if it turns out 4.7 isn't doing it for your workflow, check how GPT and Gemini are scoring right now over on nerfedornot.com.