TAI #202: GPT-5.5 Moves Codex Into Real Work
Last Updated on April 29, 2026 by Editorial Team Author(s): Towards AI Editorial Team Originally published on Towards AI. What happened this week in AI by Louie OpenAI released GPT-5.5 on April 23. In the same week, they launched workspace agents in ChatGPT and released Privacy Filter for PII redaction; Google pushed Deep Research Max and its enterprise agent platform; and DeepSeek released V4-Pro and V4-Flash with 1M-token context. The thread connecting these releases is clear: frontier labs are turning models into work systems, with tools, memory, permissions, pricing, and verification becoming as important as the base model. The useful read on GPT-5.5 is Codex. OpenAI is aiming the model at complex computer work: writing and debugging code, researching online, analyzing documents and spreadsheets, operating software, and moving across tools until a task is finished. This is the same direction Anthropic has been pushing with Opus, Claude Code, Cowork, Skills, and Claude Design. The competition is increasingly about which lab can package intelligence into a reliable worker. Codex is still less welcoming than Claude Cowork for non-technical and non-coding work. The interface can feel like a developer tool because, well, it is one. But it is now set up to be extremely valuable for white-collar work far beyond software engineering. You can create reusable skills in a similar spirit to Claude Skills or Cowork workflows, move them between projects, and build task-specific playbooks for research, reporting, data cleanup, document review, financial analysis, or internal operations. I especially like using Codex subagents. For deep research or parallel workstreams, I often run up to 20 in parallel, with some exploring sources, some checking facts, some criticizing the draft, some testing assumptions, and others running iteration loops before anything comes back to me. This is a very different way to use AI: less single-threaded prompting, more managing a small team of specialized workers. If the Codex UI feels overwhelming and you are not a developer, the non-coder view option in advanced settings is worth turning on. It makes the product feel less like a terminal-adjacent engineering cockpit and more like a general work surface. Codex also works extremely well with OpenAI’s new GPT Images 2.0 model, which is actually beating Gemini’s Nano Banana for some very complex graphics in my tests, and comes with the added advantage of integration into Codex, where sub-agents can bulk generate 10–20 images in parallel. The coding numbers support a real Codex upgrade, with a few caveats. OpenAI reports GPT-5.5 at 82.7% on Terminal-Bench 2.0, up from 75.1% for GPT-5.4, 69.4% for Claude Opus 4.7, and 68.5% for Gemini 3.1 Pro. On its internal Expert-SWE eval for long-horizon coding tasks with a median estimated human completion time of 20 hours, GPT-5.5 scores 73.1% versus 68.5% for GPT-5.4. SWE-Bench Pro is the less flattering number: GPT-5.5 reaches 58.6%, only slightly above GPT-5.4 at 57.7% and behind Opus 4.7 at 64.3%. That benchmark split is the whole point. GPT-5.5 looks strongest when the task requires terminal work, repo navigation, tool use, long context, and persistence. I would call it a meaningful improvement in the agent loop. For real software work, that is the part that matters. A useful coding agent has to inspect the repo, understand the architecture, run commands, debug failures, preserve user work, and explain the diff. The productivity gain comes from fewer correction loops, less babysitting, and more tasks that reach a reviewable state without the developer restating the obvious five times. This is also why I would measure these tools by accepted PRs, review time, defect rate, and retry count, instead of judging a single impressive answer in chat. The broader work benchmarks point in the same direction. GPT-5.5 scores 84.9% wins or ties on GDPval, 78.7% on OSWorld-Verified, 84.4% on BrowseComp, 75.3% on MCP Atlas, and 54.1% on OfficeQA Pro. It has a 400K context window in Codex and a 1,050,000-token context window in the API docs. OpenAI’s long-context results are especially strong on MRCR v2 at 512K to 1M tokens, where GPT-5.5 scores 74.0%, compared with 36.6% for GPT-5.4 and 32.2% for Opus 4.7. The cost story is more complicated than the launch framing. GPT-5.5 costs $5 per million input tokens, $0.50 per million cached input tokens, and $30 per million output tokens in the API, exactly twice GPT-5.4’s standard token price. GPT-5.5 Pro is listed in the launch post at $30 per million input tokens and $180 per million output tokens. Prompts above 272K input tokens are priced at 2x input, and 1.5x output for the full session, and Codex Fast mode generates tokens 1.5x faster for 2.5x the cost. OpenAI argues GPT-5.5 uses fewer tokens on Codex tasks, and third-party testing broadly supports partial efficiency gains. Artificial Analysis found that GPT-5.5 used about 40% fewer output tokens than GPT-5.4 on its Intelligence Index, making the full run about 20% more expensive rather than 2x as expensive. Teams should test cost per completed workflow (post-human iteration), not cost per token. A GPT-5.5 run that fixes the bug once can be cheaper than four cheap but failed attempts. OpenAI says more than 85% of its employees now use Codex weekly across engineering, finance, communications, marketing, data science, and product. The Finance team used Codex to review 24,771 K-1 tax forms totaling 71,637 pages, accelerating the task by two weeks. The go-to-market team automated weekly business reports, saving 5–10 hours per week. OpenAI also said Codex grew from more than 3 million weekly developers in early April to more than 4 million two weeks later. They also launched Codex Labs and partnerships with Accenture, Capgemini, CGI, Cognizant, Infosys, PwC, and TCS. NVIDIA gives the clearest enterprise rollout pattern. More than 10,000 employees across engineering, product, legal, marketing, finance, sales, HR, operations, and developer programs have used GPT-5.5-powered Codex. NVIDIA reports debugging cycles moving from days to hours and experimentation moving from weeks to overnight progress. The setup is more important than the quote: dedicated cloud virtual machines, auditability, zero-data retention, read-only production access, command-line […]
