GPT-5.5, Model Cards, And Trustworthy Evaluation
Comparing early GPT-5.5 performance to Claude Opus and arguing for cross-lab safety testing beyond vibes.
When a lab ships a new flagship model, what should you trust more: the vibes on X, your own quick prompts, or the one document that’s supposed to tell you what changed under the hood? With GPT-5.5 (and GPT-5.5-Pro), we’re immediately back in the familiar position of trying to infer capability, safety, and “what it’s good for” from early reactions—while waiting to see what the official paperwork actually reveals.
This post digs into the release through that lens: how GPT-5.5 seems to stack up against Claude Opus in practice, what (if anything) looks different on alignment and agentic risk, and how much confidence you can reasonably extract from OpenAI’s model card and evals. Along the way, it raises a sharper question: if there were a new dangerous capability or alignment regression, would the current testing and disclosure regime even notice—and what would a better, cross-lab “run all the tests” approach look like?
Last week, OpenAI announced GPT-5.5, including GPT-5.5-Pro. The question most people immediately ask is simple: is it actually better, and better for what?
It’s still early and reactions are still coming in, but my overall read is that GPT-5.5 is a solid improvement. More importantly, it feels like it shifts the “which model should I use?” decision from brand preference to task fit.
What GPT-5.5 seems to be good at
In my view, GPT-5.5 is now competitive with Claude Opus for many purposes, especially when the work is straightforward and well-specified. If you know what you want, and you want it cleanly executed, GPT-5.5 looks like a strong default.
- “Just the facts” queries where you want direct, grounded answers
- Web searches and research-style lookups
- Straightforward requests with clear constraints and an obvious target output
Where Claude Opus still looks like the better bet
My guess on the shape here is that Claude Opus 4.7 remains the choice for tasks that benefit from interpretation: fuzzy prompts, ambiguous intent, and work where you want the model to take initiative in framing.
- Open-ended writing and exploratory thinking
- Interpretive tasks where the “right answer” depends on judgment
- Ambiguous problems where clarifying the question is half the work
A practical “hybrid” approach for coders
If you’re coding, it may be worth thinking less in terms of “pick a winner” and more in terms of workflow. A hybrid approach can make sense: use one model for precision and direct execution, and the other for design-level thinking or creative problem solving.
This doesn’t have to be complicated. It can be as simple as: draft or plan in the model that’s better at open-ended reasoning, then implement and refine in the model that’s better at well-specified requests.
Alignment and safety: mostly familiar, with one wrinkle
On the alignment and safety fronts, it’s unlikely GPT-5.5 poses new big risks. Its alignment seems similar to that of previous models.
The main caveat is a small additional risk that comes from improved agentic abilities, including computer use. As models get better at taking actions rather than just answering questions, the consequences of mistakes (or misuse) can go up, even if the underlying “personality” of the model hasn’t changed much.
Net: GPT-5.5 looks like a meaningful step forward, and it narrows the gap enough that model choice becomes more about matching the tool to the job than chasing the latest release. If your work is factual, structured, or clearly specified, GPT-5.5 is an easy pick.
If your work is interpretive or open-ended, Claude Opus 4.7 still looks like the better fit. And if you code for a living, you may not need to choose at all—building a simple hybrid workflow can get you the best of both worlds.
As always, when it is available, the system or model card is where we start.
OpenAI does not drop the giant doorstops that Anthropic gives us with every release.
After reading the Mythos and Opus 4.7 model cards, this strikes me as stingy. There’s still good info here, but overall it tells you relatively little about what is going on, and feels incurious and more pro forma.
I would like to see a ‘yes and’ approach to what evaluations are run here, with cooperation between OpenAI and Anthropic (and ideally Google and others), where all labs run all the tests that any lab runs. This would give us a relatively robust set of tests, and also give us comparisons.
I notice that if there were new alignment problems, or new dangerous capabilities, I am very not confident that the tests here would pick it up. This is all pretty thin. What I am relying on is the gestalt, including of how people are reacting, and in this case it seems far enough from the edge to be conclusive.
GPT-5.5 was trained through the usual methods.
There is a jailbreak bounty program:
We have launched a public bug bounty program that will allow selected (via invitation and application) researchers to submit universal jailbreaks.
Here is its self-portrait:
0 Comments
No comments yet.