Code Velocity
AI Models

Claude Opus 4.6: #1 in Coding and Reasoning Benchmarks

·7 min read·Anthropic, OpenAI·Original source
Share
Claude Opus 4.6 benchmark comparison chart showing #1 rankings on Terminal-Bench 2.0, Humanity's Last Exam, and GDPval-AA

Claude Opus 4.6 Benchmark Results

Claude Opus 4.6 is Anthropic's most capable model, setting new records in coding, reasoning, and knowledge work. It achieves the top score on Terminal-Bench 2.0, the leading benchmark for agentic coding, and leads all frontier models on Humanity's Last Exam, a multidisciplinary reasoning test.

For developers already using Claude Sonnet 4.6 for coding tasks, Opus 4.6 represents the next tier of performance for complex, multi-step agentic work.

Coding Performance: #1 on Terminal-Bench 2.0

Opus 4.6 improves on its predecessor's coding skills in every dimension:

  • Careful planning: Plans more thoughtfully before writing code
  • Sustained agentic tasks: Maintains context and quality over longer coding sessions
  • Large codebase navigation: Operates more reliably in complex, multi-file projects
  • Self-correction: Better code review and debugging skills to catch its own mistakes

On Terminal-Bench 2.0, which tests real-world system administration and coding tasks, Opus 4.6 achieves the highest score of any model.

Claude Opus 4.6 vs GPT-5.2 vs Gemini 2.5

BenchmarkOpus 4.6GPT-5.2Gemini 2.5
Terminal-Bench 2.0#1#2#3
Humanity's Last Exam#1#3#2
GDPval-AA#1 (+144 Elo vs GPT-5.2)#2#3
BrowseComp#1#2

On GDPval-AA, which measures performance on economically valuable knowledge work in finance, legal, and other domains, Opus 4.6 outperforms GPT-5.2 by 144 Elo points and its own predecessor (Opus 4.5) by 190 points.

New Developer Features in Claude Opus 4.6

Agent Teams in Claude Code

You can now assemble agent teams to work on tasks together within Claude Code. Multiple Claude instances collaborate on different parts of a codebase simultaneously, speeding up complex refactoring, feature development, and bug fixing. The same agent teams capability powers Claude Code Security, which uses multiple agents to scan, verify, and validate vulnerabilities.

Compaction for Long-Running Tasks

Claude can now summarize its own context during long-running tasks. This means agentic coding sessions can run much longer without hitting context window limits. For complex, multi-file changes that involve hundreds of tool calls, compaction keeps the session productive without restarting.

Adaptive Thinking

The model picks up on contextual clues about how much extended thinking to apply. For simple questions, it responds quickly. For complex coding problems, it thinks more deeply. Developers also get new effort controls to balance cost, speed, and intelligence per request.

1M Token Context Window

Like Claude Sonnet 4.6, Opus 4.6 features a 1M token context window in beta. This is a first for Opus-class models, enabling processing of entire large codebases in a single request.

Claude Opus 4.6 Pricing and Availability

Opus 4.6 is available on claude.ai, the API (claude-opus-4-6), Amazon Bedrock, and Google Cloud Vertex AI at $5/$25 per million tokens.

Frequently Asked Questions

What benchmarks does Claude Opus 4.6 lead?
Claude Opus 4.6 holds the #1 position on four major benchmarks: Terminal-Bench 2.0 for agentic coding, Humanity's Last Exam for multidisciplinary reasoning, BrowseComp for information retrieval, and GDPval-AA for knowledge work. On GDPval-AA, it outperforms GPT-5.2 by 144 Elo points and its predecessor Opus 4.5 by 190 points. These results make it the highest-scoring frontier model across both coding and reasoning tasks as of February 2026.
What are agent teams in Claude Code?
Agent teams is a new feature in Claude Code that lets multiple Claude instances collaborate on tasks in parallel. For example, one agent can refactor a module while another writes tests and a third updates documentation. This parallel approach speeds up complex codebase changes that would take a single agent much longer. Agent teams launched alongside Opus 4.6 and work with both Opus and Sonnet models.
What is compaction in Claude Opus 4.6?
Compaction is a context management feature that lets Claude summarize its own conversation history during long-running agentic tasks. When a coding session approaches the context window limit, compaction condenses earlier context into a summary so Claude can keep working without losing track of the task. This is especially useful for multi-file refactoring sessions that involve hundreds of tool calls and file reads.
How much does Claude Opus 4.6 cost?
Claude Opus 4.6 costs $5 per million input tokens and $25 per million output tokens, the same pricing as previous Opus models. It is available on claude.ai, the Anthropic API with model ID claude-opus-4-6, Amazon Bedrock, and Google Cloud Vertex AI. For comparison, Claude Sonnet 4.6 offers similar coding quality at $3/$15 per million tokens.

Stay Updated

Get the latest AI news delivered to your inbox.

Share