OpenAI Introduces GPT 5.2: A Long Context Workhorse For Agents, Coding And Knowledge Work

OpenAI has just introduced GPT-5.2, its most advanced frontier model for professional work and long running agents, and is rolling it out across ChatGPT and the API.

GPT-5.2 is a family of three variants. In ChatGPT, users see ChatGPT-5.2 Instant, Thinking and Pro. In the API, the corresponding models are gpt-5.2-chat-latest, gpt-5.2, and gpt-5.2-pro. Instant targets everyday assistance and learning, Thinking targets complex multi step work and agents, and Pro allocates more compute for hard technical and analytical tasks.

Benchmark profile, from GDPval to SWE Bench

GPT-5.2 Thinking is positioned as the main workhorse for real world knowledge work. On GDPval, an evaluation of well specified knowledge tasks across 44 occupations in 9 large industries, it beats or ties top industry professionals on 70.9 percent of comparisons, while producing outputs at more than 11 times the speed and under 1 percent of the estimated expert cost. For engineering teams this means the model can reliably generate artifacts such as presentations, spreadsheets, schedules, and diagrams given structured instructions.

On an internal benchmark of junior investment banking spreadsheet modeling tasks, average scores rise from 59.1 percent with GPT-5.1 to 68.4 percent with GPT-5.2 Thinking and 71.7 percent with GPT-5.2 Pro. These tasks include three statement models and leveraged buyout models with constraints on formatting and citations, which is representative of many structured enterprise workflows.

In software engineering, GPT-5.2 Thinking reaches 55.6 percent on SWE-Bench Pro and 80.0 percent on SWE-bench Verified. SWE-Bench Pro evaluates repository level patch generation over multiple languages, while SWE-bench Verified focuses on Python.

Long context and agentic workflows

Long context is a core design target. GPT-5.2 Thinking sets a new state of the art on OpenAI MRCRv2, a benchmark that inserts multiple identical ‘needle’ queries into long dialogue “haystacks” and measures whether the model can reproduce the correct answer. It is the first model reported to reach near 100 percent accuracy on the 4 needle MRCR variant out to 256k tokens.

For workloads that exceed even that context, GPT-5.2 Thinking integrates with the Responses /compact endpoint, which performs context compaction to extend the effective window for tool heavy, long running jobs. This is relevant if you are building agents that iteratively call tools over many steps and need to maintain state beyond the raw token limit.

On tool usage, GPT-5.2 Thinking reaches 98.7 percent on Tau2-bench Telecom, a multi turn customer support benchmark where the model must orchestrate tool calls across a realistic workflow. The official examples from OpenAI release post show scenarios like a traveler with a delayed flight, missed connection, lost bag and medical seating requirement, where GPT-5.2 manages rebooking, special assistance seating and compensation in a consistent sequence while GPT-5.1 leaves steps unfinished.

Vision, science and math

Vision quality also moves up. GPT-5.2 Thinking roughly halves error rates on chart reasoning and user interface understanding benchmarks like CharXiv Reasoning and ScreenSpot Pro when a Python tool is enabled. The model shows improved spatial understanding of images, for example when labeling motherboard components with approximate bounding boxes, GPT-5.2 identifies more regions with tighter placement than GPT-5.1.

For scientific workloads, GPT-5.2 Pro scores 93.2 percent and GPT-5.2 Thinking 92.4 percent on GPQA Diamond, and GPT-5.2 Thinking solves 40.3 percent of FrontierMath Tier 1 to Tier 3 problems with Python tools enabled. These benchmarks cover graduate level physics, chemistry, biology and expert mathematics, and OpenAI highlights early use where GPT-5.2 Pro contributed to a proof in statistical learning theory under human verification.

Comparison Table

Model	Primary positioning	Context window / max output	Knowledge cutoff	Notable benchmarks (Thinking / Pro vs GPT-5.1 Thinking)
GPT-5.1	Flagship model for coding and agentic tasks with configurable reasoning effort	400,000 tokens context, 128,000 max output	2024-09-30	SWE-Bench Pro 50.8 percent, SWE-bench Verified 76.3 percent, ARC-AGI-1 72.8 percent, ARC-AGI-2 17.6 percent
GPT-5.2 (Thinking)	New flagship model for coding and agentic tasks across industries and for long running agents	400,000 tokens context, 128,000 max output	2025-08-31	GDPval wins or ties 70.9 percent vs industry professionals, SWE-Bench Pro 55.6 percent, SWE-bench Verified 80.0 percent, ARC-AGI-1 86.2 percent, ARC-AGI-2 52.9 percent
GPT-5.2 Pro	Higher compute version of GPT-5.2 for the hardest reasoning and scientific workloads, produces smarter and more precise responses	400,000 tokens context, 128,000 max output	2025-08-31	GPQA Diamond 93.2 percent vs 92.4 percent for GPT-5.2 Thinking and 88.1 percent for GPT-5.1 Thinking, ARC-AGI-1 90.5 percent and ARC-AGI-2 54.2 percent

Key Takeaways

GPT-5.2 Thinking is the new default workhorse model: It replaces GPT-5.1 Thinking as the main model for coding, knowledge work and agents, while keeping the same 400k context and 128k max output, but with clearly higher benchmark performance across GDPval, SWE-Bench, ARC-AGI and scientific QA.
Substantial accuracy jump over GPT-5.1 at similar scale: On key benchmarks, GPT-5.2 Thinking moves from 50.8 percent to 55.6 percent on SWE-Bench Pro and from 76.3 percent to 80.0 percent on SWE-bench Verified, and from 72.8 percent to 86.2 percent on ARC-AGI-1 and from 17.6 percent to 52.9 percent on ARC-AGI-2, while keeping token limits comparable.
GPT-5.2 Pro is targeted at high end reasoning and science: GPT-5.2 Pro is a higher compute variant that mainly improves hard reasoning and scientific tasks, for example reaching 93.2 percent on GPQA Diamond versus 92.4 percent for GPT-5.2 Thinking and 88.1 percent for GPT-5.1 Thinking, and higher scores on ARC-AGI tiers.

The post OpenAI Introduces GPT 5.2: A Long Context Workhorse For Agents, Coding And Knowledge Work appeared first on MarkTechPost.