Robots Atlas>ROBOTS ATLAS
Claude 3.7 Sonnet

Claude 3.7 Sonnet

claude-3-7-sonnet-20250219 · Family: Claude
Claude 3.7 Sonnet is Anthropic's hybrid reasoning model released February 24, 2025. Combines standard mode with extended thinking mode and configurable thinking token budget. Retired from Anthropic API on October 28, 2025.
⚠ Deprecated⏳ Limited accessLLMMultimodalReasoning modelTool-using model📁 Claude
Context window
200K tokenów
tokens
Max output
128,000
tokens
Release date
24 February 2025
Access:APIHostedDeployment:☁ Cloud

Overview

Claude 3.7 Sonnet is a hybrid reasoning model from Anthropic, released on February 24, 2025. API snapshot: claude-3-7-sonnet-20250219. It is the first Anthropic model to combine a standard mode with an extended thinking mode featuring a configurable token budget.

Key features

Context window of 200,000 tokens, with a maximum of 128,000 output tokens (including thinking tokens). Knowledge cutoff: October 31, 2024. Supports tool use, computer use, and multimodal input (text, image, documents). Fine-tuning is not available.

Benchmark results

SWE-bench Verified: 62.3% (pass@1, minimal scaffold) and 70.3% with custom scaffold and test-time compute. GPQA Diamond: 68.0% in standard mode and 84.8% with extended thinking. AIME 2024: 61.3% (standard) / 80.0% (ET). MATH 500: 96.2% (ET). TAU-bench: Retail 81.2%, Airline 58.4%. IFEval: 93.2%. MMLU: 86.1%.

Pricing

$3 USD/MTok input and $15 USD/MTok output (thinking tokens counted as output). Batch API: 50% discount. Prompt caching: cache write $3.75 USD/MTok (5 min TTL), cache read $0.30 USD/MTok.

Safety and status

The model was evaluated at AI Safety Level 2 (ASL-2) under Anthropic's Responsible Scaling Policy. Unnecessary refusals were reduced by 45% (standard) and 31% (extended thinking) relative to Claude 3.5 Sonnet. The model was deprecated in the Anthropic API on October 28, 2025; on Vertex AI it was deprecated on November 11, 2025, with a shutdown date of May 11, 2026 for existing customers.

Classification
LLMMultimodalReasoning modelTool-using model
Family: Claude
Access & deployment
APIHosted
Cloud
Weights: Closed
Key parameters
📏 Context: 200K tokenów
Tools
📥 Input: text, image, documents

Technical specification

Context window
200K tokenów
tokens
Max output tokens
128,000
tokens per response
Knowledge cutoff
31 Oct 2024
Knowledge boundary
License
Proprietary
Hardware requirements
Not applicable / no public data, as the model was not designed for local deployment.
Features:Tool use
Modalities
⬇ Input
textimagedocuments
⬆ Output
textcodestructured_data

Capabilities and applications

Native model capabilities
Reasoning
The model's ability to reason logically and solve complex problems.
Category: reasoning
Multi-step reasoning
Carrying out multi-step chains of reasoning across long, complex tasks.
Category: reasoning
Long context
Maintaining coherence and focus across very long input context.
Category: language
Coding
Generating, analysing and modifying source code.
Category: coding
Function Calling
Category: planning
Structured output
Producing data in structured formats such as JSON.
Category: structured_generation
Image understanding
Analysing and interpreting the content of images.
Category: vision
Chart understanding
Reading and interpreting charts, tables and diagrams.
Category: vision
OCR
Recognising text within images and documents.
Category: vision
Multilingual
Understanding and generating text in many languages.
Category: language
Planning
Forming and executing action plans for complex tasks.
Category: planning
Streaming output
Category: reasoning

Benchmark results

14 benchmarks
GPQA Diamond
78
📅 24 Feb 2025📄 Anthropic – Claude 3.7 Sonnet announcement benchmark table
SWE-bench Verified
62.3
📄 Anthropic – Claude 3.7 Sonnet announcement + appendix
TAU-bench Retail
81.2
📄 Anthropic – Claude 3.7 Sonnet announcement benchmark table
TAU-bench Airline
58.4
📄 Anthropic – Claude 3.7 Sonnet announcement benchmark table
MMMLU
86.1
📄 Anthropic – Claude 3.7 Sonnet announcement benchmark table
MMMU (validation)
75.0
📄 Anthropic – Claude 3.7 Sonnet announcement benchmark table
IFEval
93.2
📄 Anthropic – Claude 3.7 Sonnet announcement benchmark table
MATH 500
96.2
📄 Anthropic – Claude 3.7 Sonnet announcement benchmark table
AIME 2024
61.3
📄 Anthropic – Claude 3.7 Sonnet announcement benchmark table
SWE-bench Verified (high compute, custom scaffold)
accuracy · Multiple parallel attempts, patches failing regression tests are discarded, ranking via scoring model. Subset of 489/500 tasks.
70.3%
📅 24 Feb 2025📄 Anthropic – oficjalny blog claude-3-7-sonnet (anthropic.com/news/claude-3-7-sonnet)
Result achieved with additional test-time compute and parallel sampling — not directly available to API users.
GPQA Diamond (standard mode)
accuracy · Graduate-level questions in Physics, Chemistry, and Biology. Without extended thinking.
68.0%
📅 24 Feb 2025📄 DataCamp – Claude 3.7 Sonnet: Features, Access, Benchmarks (datacamp.com/blog/claude-3-7-sonnet)
With extended thinking: 84.8%. Source: DataCamp, citing Anthropic data.
GPQA Diamond (extended thinking)
accuracy · Graduate-level STEM questions with extended thinking mode enabled.
84.8%
📅 24 Feb 2025📄 DataCamp – Claude 3.7 Sonnet benchmarks (datacamp.com/blog/claude-3-7-sonnet)
The score with extended thinking, combined with additional test-time compute (256 responses + scoring model), is even higher — DataCamp and LessWrong confirm 84.8% as the baseline ET result.
AIME 2024 (extended thinking)
accuracy · American Invitational Mathematics Examination 2024. Extended thinking mode.
80.0%
📅 24 Feb 2025📄 DataCamp – Claude 3.7 Sonnet benchmarks (datacamp.com/blog/claude-3-7-sonnet)
Score of 61.3% in standard mode. An increase of ~18.7 pp. with extended thinking.
MMLU
accuracy · Massive Multitask Language Understanding benchmark covering 57 subject domains.
86.1%
📅 24 Feb 2025📄 Vellum AI – Claude 3.7 Sonnet vs OpenAI o1 vs DeepSeek R1 (vellum.ai/blog/claude-3-7-sonnet-vs-openai-o1-vs-deepseek-r1)
Score from the benchmark table published at launch.

Pricing

Technical architecture

Deployment and security

🔒 Security / Enterprise
✓ Verified enterprise information

Model evaluated under ASL-2 (Anthropic Responsible Scaling Policy). The system card covers CBRN, cybersecurity, autonomy, extended thinking faithfulness, and computer use evaluations. Unnecessary refusals reduced by 45% relative to Claude 3.5 Sonnet.

1) ASL-2 classification assigned following an iterative evaluation process (6 model snapshots, FRT + AST assessments). 2) Unnecessary refusals reduced by 45% (standard mode) and 31% (extended thinking mode) vs. Claude 3.5 Sonnet. 3) Extended thinking chains of thought do not fully reflect the actual reasoning process — disclosed by Anthropic in the system card. 4) Computer use evaluation included analysis of prompt injection attacks. 5) Model retired from the Anthropic API on 28.10.2025. Deprecated on Vertex AI from 11.11.2025 (shutdown 11.05.2026 for existing customers). Researcher access available via the External Researcher Access Program.
Updated: 24 Feb 2025↗ Security documentation