Robots Atlas>ROBOTS ATLAS
GPT-4.1

GPT-4.1

gpt-4.1-2025-04-14 · Family: GPT
GPT-4.1 is an OpenAI API model released April 14, 2025. Features a 1M token context window, 54.6% on SWE-bench Verified, and precise literal instruction following. Designed for developers building agentic coding workflows.
✓ Active✓ Public accessLLMMultimodalTool-using model📁 GPT
Context window
1M tokens
tokens
Parameters
Undisclosed
parameters
Max output
32,768
tokens
Release date
14 April 2025
Access:APIHostedDeployment:☁ Cloud

Overview

GPT-4.1 is an OpenAI language model released on April 14, 2025, available exclusively via API (not in ChatGPT at launch). API snapshot: gpt-4.1-2025-04-14. It was designed for developers building agentic coding systems, with emphasis on instruction following and long-context tasks.

Key features

Context window of 1,047,576 tokens (1M); maximum output tokens: 32,768. Knowledge cutoff: June 2024. Supports tool use, fine-tuning, and multimodal input (text, image, documents).

Benchmark results

On SWE-bench Verified, it achieves 54.6% (conservative score 52.1%) — an improvement of 21.4 pp. over GPT-4o (33.2%). On Aider Polyglot diff, it scores 52.9% (2.9× better than GPT-4o). MMLU 90.2%, MMMU 74.8%, MathVista 72.2%, Video-MME (long) 72.0%. On the Needle in Haystack test at 1M token context — 100% recall.

Pricing and availability

Closed-weights model, available through the OpenAI API, Azure AI Foundry, and other hosting platforms. Pricing: $2/MTok input, $8/MTok output, cached input $0.50/MTok (75% discount). Batch API with 50% discount. No price premium for long context up to 1M tokens. Model retired from ChatGPT on February 13, 2026; still available via API.

Safety

OpenAI did not publish a separate system card, classifying the model as non-frontier. Independent research (Owain Evans/Oxford, ICML 2025; SplxAI) identified elevated misalignment risk after fine-tuning on unsafe code, as well as a tendency toward literal, more easily circumvented instruction following. In response, OpenAI published a dedicated prompting guide.

Classification
LLMMultimodalTool-using model
Family: GPT
Access & deployment
APIHosted
Cloud
Weights: Closed
Key parameters
📏 Context: 1M tokens
🧩 Parameters: Undisclosed
Tools · ✓ Fine-tuning
📥 Input: text, image, structured data, urls

Technical specification

Context window
1M tokens
tokens
Parameters
Undisclosed
parameters
Max output tokens
32,768
tokens per response
Knowledge cutoff
1 Jun 2024
Knowledge boundary
License
Proprietary (OpenAI API license)
Hardware requirements
The model is not available for local deployment. It operates exclusively through OpenAI's infrastructure and is accessible via API.
Features:Tool useFine-tuning
Modalities
⬇ Input
textimagestructured_dataurlsdocuments
⬆ Output
analytical_reportscodestructured_datasummariestext

Capabilities and applications

Native model capabilities
Reasoning
Category: reasoning
Multi-step reasoning
Category: reasoning
Long context
Category: reasoning
Coding
Category: coding
Function Calling
Category: planning
Structured output
Category: structured_generation
Image understanding
Category: vision
Chart understanding
Category: vision
OCR
Category: vision
Multilingual
Category: language
Planning
Category: planning
Streaming output
Category: reasoning

Benchmark results

13 benchmarks
MMLU
accuracy · Massive Multitask Language Understanding benchmark covering 57 subject areas.
90.2%
📅 14 Apr 2025📄 RD World Online / OpenAI (prezentacja premiery)
Score reported by OpenAI during the launch livestream.
SWE-bench Verified
accuracy · Benchmark of real-world software engineering tasks sourced from GitHub. 23 out of 500 tasks that could not be executed on OpenAI infrastructure were excluded. Conservative score (with infrastructure): 52.1%.
54.6%
📅 14 Apr 2025📄 OpenAI – oficjalny blog gpt-4-1 (openai.com/index/gpt-4-1/)
Improvement of 21.4 pp. over GPT-4o (33.2%) and 26.6 pp. over GPT-4.5 (28.0%). Outperforms o1 and o3-mini on this benchmark. Claude 3.7 Sonnet (~62–63%) and Gemini 2.5 Pro (~64%) achieved higher scores.
MultiChallenge
accuracy · Scale AI benchmark testing instruction-following in multi-turn conversations (4 categories of information from previous messages).
38.3%
📅 14 Apr 2025📄 OpenAI – oficjalny blog gpt-4-1 / Scale AI
Improvement of 10.5 pp. over GPT-4o (27.8%). GPT-4.5 scored 43.8% on this benchmark.
IFEval
accuracy · Benchmark testing compliance with verifiable instructions (format, length, content, avoiding specific phrases).
87.4%
📅 14 Apr 2025📄 OpenAI – oficjalny blog gpt-4-1
Improvement of 6.4 pp. over GPT-4o (81.0%). GPT-4.5 scored 88.2%.
Video-MME (long, no subtitles)
accuracy · Multiple-choice questions based on 30–60-minute video recordings without subtitles.
72.0%
📅 14 Apr 2025📄 OpenAI – oficjalny blog gpt-4-1
State-of-the-art result at launch. An improvement of 6.7 pp. over GPT-4o (65.3%).
MMMU
accuracy · Multimodal academic reasoning tasks involving charts, diagrams, maps, and similar visual content.
74.8%
📅 14 Apr 2025📄 OpenAI – oficjalny blog gpt-4-1
GPT-4o: 68.7%, GPT-4.5: 75.2%. Marginally lower than GPT-4.5.
MathVista
accuracy · Visual mathematical reasoning.
72.2%
📅 14 Apr 2025📄 OpenAI – oficjalny blog gpt-4-1 / Pankaj Rajan / Medium
GPT-4o: 61.4%, GPT-4.5: 72.3%. Comparable result to GPT-4.5 at significantly lower cost.
Aider Polyglot (diff format)
accuracy · Benchmark for code editing in diff format across multiple programming languages.
52.9%
📅 14 Apr 2025📄 OpenAI – oficjalny blog gpt-4-1
Improvement of over 2.9× compared to GPT-4o (18.2%). GPT-4.5: 44.9%, o3-mini-high: 60.4%. Reduction of unnecessary edits from 9% (GPT-4o) to 2%.
OpenAI-MRCR (2-needle, 128K)
accuracy · Multi-Round Coreference – locating 2 hidden answers within a 128K-token context. GPT-4o: 31.9%, GPT-4.5: 38.5%.
57.2%
📅 14 Apr 2025📄 OpenAI – oficjalny blog gpt-4-1
OpenAI open-sourced this benchmark. Performance drops from ~84% at 8K tokens to ~50% at 1M tokens (officially acknowledged degradation).
Graphwalks (BFS <128K)
accuracy · Multi-hop reasoning in long contexts (breadth-first search). GPT-4o: 41.7%, GPT-4.5: 72.3%, o1-high: 62.0%.
61.7%
📅 14 Apr 2025📄 OpenAI – oficjalny blog gpt-4-1
Improvement of 19.7 pp. over GPT-4o. Performance close to o1-high, below GPT-4.5.
Needle in Haystack (1M tokens)
accuracy · Retrieving a single hidden piece of information at each position within the context window (up to 1M tokens).
100%
📅 14 Apr 2025📄 OpenAI – oficjalny blog gpt-4-1 / Helicone
100% precision across all positions and all context lengths.
OpenAI Internal Instruction Following
accuracy · Internal OpenAI benchmark for measuring instruction following. GPT-4o: 29%.
49%
📅 14 Apr 2025📄 TechTarget / OpenAI launch event
Approximately 20 pp. improvement over GPT-4o on an internal instruction-following benchmark.
SWE-bench Verified (conservative / infrastructure-excluded)
accuracy · SWE-bench Verified variant excluding 23 tasks that could not be executed on OpenAI infrastructure.
52.1%
📅 14 Apr 2025📄 OpenAI – oficjalny blog gpt-4-1 (przypis [2])
Result is conservative, confirmed by OpenAI as an alternative metric.

Pricing

Deployment and security

🔒 Security / Enterprise
✓ Verified enterprise information

OpenAI publishes security and enterprise documentation for its platform, covering the API and ChatGPT Enterprise/Business/Edu offerings. For GPT-4.1, security information is platform-level rather than a dedicated per-model safety sheet. Publicly documented aspects include data encryption, access controls, compliance certifications, and policies on the use of customer data for model training.

Security information for GPT-4.1 should be treated as pertaining to the OpenAI API environment and enterprise products, rather than as a model-specific security specification. In practice, this is the appropriate approach for an AI systems catalog.
Updated: 15 Mar 2026↗ Security documentation