Biofinysics: May 2026

..this is a living document that I will update as I go....
...I do not guarantee it is finished now or ever will be...
...it is safe to assume I'm open to talking about this and am willing to learn...
..this is version 1.0; 2026-05-01
..previous versions: none

..Status: working draft / living document / corrections welcome

---

GPT models are what powers ChatGPT as well as agentic coding in Codex and other platforms. Copilot is essentially another chatbot inside VSCode that deploys a wide range of AI models, including GPT models as well as Claude, Gemini, and other models. Copilot is essentially a layer over those models, and I believe it is an important layer that helps steer, or shepherd, the models toward being better coding collaborators. This was especially true in my experience with GPT-5.4 in Copilot vs Codex. I observed that Codex was like the "wild wild West" version of Copilot even when both are using the same same reasoning level. It was not as refined. It did not always execute everything in my prompt without further prompting.

I did not fully understand at the time how this could be possible. I thought I was going crazy so I talked to ChatGPT about it. ChatGPT confirmed that the difference is very real because Copilot uses GPT-5.4 with its own set of heavy instructions on how it is supposed to behave — basically like a super eager, capable, amazing worker who does what you ask and wants a pat on the head and follows all the rules. ChatGPT explained that Codex is not geared up to be like that - so it ends up feeling like working with someone who is super smart but lazy, someone who tries to get away with cutting corners and doing the bare minimum, and who doesn’t really read the instructions, and such.

In my experience between late March and late April 2026, there were marked behavioral differences between Copilot and Codex with GPT-5.4. Codex with GPT-5.4 was great, but did not perform as well as Claude Code (Sonnet). In contrast, Copilot with GPT-5.4 for sure rivaled Claude Code (Sonnet) in at least a few ways, one being it basically never forgot to do anything it was supposed to do. I could basically always trust Copilot with GPT-5.4. With a Medium reasoning level it could quickly and accurately implement a spec made by Claude (Sonnet or Opus). With higher reasoning levels it could be used for anything, but again: this was mainly true for the Copilot context. Rivaling or exceeding Claude was particularly a possibility when using the highest reasoning modes with GPT-5.4 in Copilot, but what likely pushed it into a winning position was that this set up was much cheaper than using Claude Code. In general, it feels like your money goes farther with Copilot, but more so with GPT models than Claude models. In contrast to GPT models, my opinion around this time was that working with Claude in the Claude Code extension was better than using Copilot with Claude Models. None of this is gospel though - I was building these opinions with limited exposure. But the lesson I began to learn around this time was shaped something like:

Claude for Claude, Copilot for GPT.

Importantly, this observation about Copilot vs Codex is almost certainly already "stale".

The GPT-5.5 model came out on April 23rd. I found that Codex with GPT-5.5 substantially closed the gap. It was at least as good as Copilot with GPT-5.4. Copilot did not yet have GPT-5.5 at the time of this writing, so I am wondering if it will still be able to boost the new GPT model beyond its performance in Codex (according to me and subjective experience). Codex with GPT-5.5 and Extra High reasoning is for sure rivaling Claude Code now, even the Opus level.

The gap is narrowing or does not exist. I remained using Claude Code (Sonnet) as my main work horse, but these observations started forming the basis for how I chose to use different agents. Claude Opus or Codex with GPT-5.5 as a planner and auditor. Copilot with GPT-5.4 as a major force for faithfully implementing plans worked out by other agents, and even surfacing interesting issues that the other agents missed. In fact, due to costs, Copilot with GPT-5.4 started becoming my main workhorse even though my aim was for that to be Claude. ((And, to be thorough, Gemini CLI was used for occasional audits only around this time )).

Claude Code should 100% be in your tool belt. But yes - I think the lines are getting blurrier and blurrier. So it might really come down to what a platform adds to those models to conform them to desired behaviors. "Steering", or "shepherding", models to be better coding collaborators might be what pushes a platform or independent developers over the edge. Same model. Different steering. Better results.

I think that is what the Copilot vs Codex lesson taught me: there is a layer of regulation needed above the models that can increase the usefulness of the model as it already is.

What I am interested in is developing a system of markdown files that helps shepherd the behavior of all agents I work with within my "onionskin" repo. I have been doing this by using their "agent files" (such as CLAUDE.md, AGENTS.md, .github/copilot-instructions.md, and GEMINI.md) to direct them to an entry point to a wider set of "agent conventions" across the repo. I will discuss this further at length in a future post.

Prompt to ChatGPT: Can you make a biblical style picture of a shepherd shepherding a bunch of sheep with AI model names? The shepherd should be meaningfully leading them in a good direction toward something desirable. It should have two frames. The second frame should be a lazy shepherd just sitting on a rock, doom scrolling on his phone as the sheep are just doing whatever they want.
<<The Bible verse may be inaccurate. I did not pick the AI names for each sheep!>>

future looking

1. I want to retest how well Copilot does with the Claude models. I have been strictly using GPT models because I long ago (<2 months) came to the opinion that Copilot was not as good with Claude models as Claude Code, and so just used GPT models to have something “orthogonal” to my Claude Code sessions.

2. I have not been using Cursor at all, and there is a chance Cursor AI is better than Copilot AI for conforming these models and w/e it does to improve their coding usefulness.

3. Antigravity from Google looks very interesting. If it holds up to what I think it might be able to do, then it may be the highest ROI if one learns how to be the orchestrator above the orchestrator AI. I go through a lot of Brainstorm / Spec / Audit / Implementation cycles across 2-4 agents. With Antigravity, from what I understand, I would have to probably map out those cycles and contracts before presenting it to the orchestrator AI, who then takes on the role I have been using as the Orchestrator of multiple agents.

---

The observations in this post were made between April 16-29 (if not earlier), are likely to already be outdated. However, some of the wisdom learned I suspect will stay relevant.

This blog post was entirely written by me. Not AI at all. Except if you count ChatGPT for making the cartoons and augmented pictures. Then ok - I had help!

Biofinysics

Friday, May 1, 2026

On GPT-5.4 in Copilot and Codex: a steering layer that makes a difference