OpenAI rolls out GPT-5.5 with stronger coding, longer context; steps up rivalry with Anthropic

/3 min read

ADVERTISEMENT

New model touts major gains in coding, long-context workflows and autonomy as OpenAI targets complex enterprise tasks end-to-end
OpenAI rolls out GPT-5.5 with stronger coding, longer context; steps up rivalry with Anthropic
According to OpenAI, on the consumer AI side, ChatGPT currently has more than 900 million weekly active users and over 50 million subscribers.  Credits: Getty Images

OpenAI has launched GPT-5.5, calling it its "smartest and most intuitive to use model yet," highlighting gains in coding, multi-step task execution and enterprise workflows, with the model positioned to handle “complex, real-world tasks from start to finish.”

According to the company, GPT-5.5 delivers measurable improvements across benchmarks, scoring 84.9% on GDPval, a benchmark for knowledge work, 78.7% on OSWorld (it evaluates real-world computer task execution), and around 98% on telecom workflow tasks (Tau2 benchmark).

In comparison to its previous model, the GPT-5.4, the company GPT‑5.5 matches GPT‑5.4 per-token latency in real-world serving, while performing at a much higher level of intelligence. It also uses significantly fewer tokens to complete the same Codex tasks.

On the infrastructure side, the model supports up to 1 million tokens of context in the API and around 400,000 tokens in Codex environments, enabling it to process large documents and sustain longer workflows.

In coding, OpenAI said the model is better at “debugging complex systems, handling multi-file changes, and maintaining consistency across large codebases.” The company added that GPT-5.5 is designed to reduce back-and-forth prompting by carrying tasks through multiple steps.

This puts it in direct competition with Anthropic, which in its own model releases has emphasised Claude’s strength in structured reasoning and long-context reliability. Anthropic focuses on performance in reasoning-heavy benchmarks such as SWE-Bench-style evaluations.

Coding gains, but split persists with Claude on reasoning benchmarks

OpenAI said GPT-5.5 shows improvements in execution-heavy coding environments, particularly where models are required to interact with tools, run commands, and iterate on outputs.

“GPT-5.5 is our most capable model for coding and complex workflows,” the company said, adding that it can “predict failures, identify edge cases, and carry changes across entire systems.”

The model also introduces two configurations—GPT-5.5 Thinking and GPT-5.5 Pro, with the latter focused on higher accuracy and deeper reasoning tasks.

Safety, cybersecurity controls and expanded bug bounty programme

OpenAI said GPT-5.5 includes strengthened safeguards, particularly in cybersecurity-related risk detection and misuse prevention.

“We’ve improved the model’s ability to recognise and refuse harmful or exploitative instructions,” the company noted, adding that GPT-5.5 is designed to limit assistance in high-risk domains such as vulnerability exploitation.

The release is accompanied by updates to OpenAI’s safety processes, including staged deployment and continuous monitoring of model behaviour in sensitive use cases.

In a separate announcement, OpenAI said it is expanding its bug bounty programme to cover newer models, including GPT-5.5. The company is offering financial rewards to external researchers who identify vulnerabilities, as part of efforts to “identify and fix issues before they can be exploited in the real world.”

Pricing of the GPT-5.5

OpenAI has priced GPT-5.5 for enterprise-scale usage: $5 per million input tokens, $30 per million output tokens, and up to $180 per million output tokens for the Pro variant.

The company said the model is also more token-efficient than GPT-5.4, allowing it to perform more work with fewer tokens.

Reaction to the GPT-5.5

Senior engineers who tested the model said GPT‑5.5 was “noticeably stronger” than GPT‑5.4 and Claude Opus 4.7 at reasoning and autonomy, catching issues in advance and predicting testing and review needs without explicit prompting. In one case, an engineer asked it to re-architect a comment system in a collaborative markdown editor and returned to a 12-diff stack that was nearly complete. One engineer at NVIDIA who had early access to the model went as far as to say: "Losing access to GPT‑5.5 feels like I've had a limb amputated.”