Claude Opus 4.6 Benchmark Results
Claude Opus 4.6 is Anthropic's most capable model, setting new records in coding, reasoning, and knowledge work. It achieves the top score on Terminal-Bench 2.0, the leading benchmark for agentic coding, and leads all frontier models on Humanity's Last Exam, a multidisciplinary reasoning test.
For developers already using Claude Sonnet 4.6 for coding tasks, Opus 4.6 represents the next tier of performance for complex, multi-step agentic work.
Coding Performance: #1 on Terminal-Bench 2.0
Opus 4.6 improves on its predecessor's coding skills in every dimension:
- Careful planning: Plans more thoughtfully before writing code
- Sustained agentic tasks: Maintains context and quality over longer coding sessions
- Large codebase navigation: Operates more reliably in complex, multi-file projects
- Self-correction: Better code review and debugging skills to catch its own mistakes
On Terminal-Bench 2.0, which tests real-world system administration and coding tasks, Opus 4.6 achieves the highest score of any model.
Claude Opus 4.6 vs GPT-5.2 vs Gemini 2.5
| Benchmark | Opus 4.6 | GPT-5.2 | Gemini 2.5 |
|---|---|---|---|
| Terminal-Bench 2.0 | #1 | #2 | #3 |
| Humanity's Last Exam | #1 | #3 | #2 |
| GDPval-AA | #1 (+144 Elo vs GPT-5.2) | #2 | #3 |
| BrowseComp | #1 | #2 | — |
On GDPval-AA, which measures performance on economically valuable knowledge work in finance, legal, and other domains, Opus 4.6 outperforms GPT-5.2 by 144 Elo points and its own predecessor (Opus 4.5) by 190 points.
New Developer Features in Claude Opus 4.6
Agent Teams in Claude Code
You can now assemble agent teams to work on tasks together within Claude Code. Multiple Claude instances collaborate on different parts of a codebase simultaneously, speeding up complex refactoring, feature development, and bug fixing. The same agent teams capability powers Claude Code Security, which uses multiple agents to scan, verify, and validate vulnerabilities.
Compaction for Long-Running Tasks
Claude can now summarize its own context during long-running tasks. This means agentic coding sessions can run much longer without hitting context window limits. For complex, multi-file changes that involve hundreds of tool calls, compaction keeps the session productive without restarting.
Adaptive Thinking
The model picks up on contextual clues about how much extended thinking to apply. For simple questions, it responds quickly. For complex coding problems, it thinks more deeply. Developers also get new effort controls to balance cost, speed, and intelligence per request.
1M Token Context Window
Like Claude Sonnet 4.6, Opus 4.6 features a 1M token context window in beta. This is a first for Opus-class models, enabling processing of entire large codebases in a single request.
Claude Opus 4.6 Pricing and Availability
Opus 4.6 is available on claude.ai, the API (claude-opus-4-6), Amazon Bedrock, and Google Cloud Vertex AI at $5/$25 per million tokens.
Original source
https://www.anthropic.com/news/claude-opus-4-6Frequently Asked Questions
What benchmarks does Claude Opus 4.6 lead?
What are agent teams in Claude Code?
What is compaction in Claude Opus 4.6?
How much does Claude Opus 4.6 cost?
Stay Updated
Get the latest AI news delivered to your inbox.
