Biofinysics: 2026

Tuesday, May 5, 2026

Good news: humans are still needed in science, and there has been no better time in history to do science than now with AI!

..this is a living document that I will update as I go....
...I do not guarantee it is finished now or ever will be...
...it is safe to assume I'm open to talking about this and am willing to learn...
..this is version 1.0; 2026-05-05
..previous versions: none

..Status: working draft / living document / corrections welcome

Learn to (review thousands of lines of) code

Before I started working extensively with agentic AI, I was fearful for my career. I had been working in biology research since 2008, doing bioinformatics since 2009, doing genomics and genome-wide analyses since 2010-2011. You could call me some sort of computational biologist, or at the very least: a person in the biology space who does most work on a computer, coding.

And that is why I was nervous! Every headline I read was something like, "Learn to code? More like learn to do physical labor!" or "It turns out smart people are dumb! (And no longer have jobs)".

Granted, clicking on one of those News stories guaranteed 1000 more would be shown to me. I knew that. But it felt like the world was turning upside down, and I wanted to understand what I was up against.

Then I spent a month+ building genomics software with Claude Code, Copilot, Codex, and Gemini.

And I felt better about my prospects.

Why?

At least for science, humans are still needed.

Briefly: yes agentic AI is amazing, but when using agentic AI for an extended period in a repo getting more and more complex, you start to notice things. You start finding out much later in development that a feature never "landed" or something was implemented entirely wrong, or an agent silently decided to only set the groundwork rather then "wiring it up", or the agent wired up some superficial patch that passes regression tests it designed… but doesn’t work for real in the wild.

You spend an evening ironing out a set of plans with the agent, launch it before bed, and wake up in the morning to find the agent skipped almost everything, changed directions, and made up its own happy little thing to do. You take a deep breath and politely ask the agent to audit itself against the original prompt, and then it starts the apology tour.

The AI can get stuff wrong. It can sometimes take short cuts, make band-aid code and quick-fix patches that mask deeper issues. It can take a complex idea and over simplify it, even iteratively simplify it across a session, mutating the intention behind it completely.

The language in this blog post is quite reserved compared to texts I was sending my brother. As someone in the trenches with agentic coding in a codebase for genomics software ("onionskin" for detecting re-replication domains given coverage bedGraphs), even though the pace of development is unrivaled, there were times when I just threw my hands up in frustration.

But then I would realize it is good news: humans are still needed.

Here is the best framing I have so far: if agents were people working at people speed, you’d fire them.

The agents can be so strongly predisposed toward these "lazy" behaviors that they can even adapt to do the same things when you try to set up shepherding architecture to steer their behavior. Or if not adapt, then simply start ignoring the instructions and apologizing later. I can imagine them thinking, "Ask for forgiveness, not permission."

We tolerate the behavioral failures, described more extensively below, because the speed of productivity is massively accelerated by AI. Thousands of lines of code can be trivially produced in one sitting. It is truly incredible. It is "high throughput coding" reminiscent of the "high throughput sequencing" revolution in genomics where suddenly millions or billions of sequences were trivial to produce. In both cases, the high throughput nature opens the door to exciting new possibilities. However, in both cases, it also means that humans are no longer able to simply sift through the output. This is one point of current tension between quantity and quality control in agentic coding.

The problem is we humans do not have the ability to work at the speed that AI produces content.

In order to keep pace, we feel some pressure or desire to trust the work being done. To keep the work moving, we are tempted to permit assumptions at each step, to assume that the agents did what they said, and what was written is what was discussed, and what the code is doing is right, and so on. But the drift accumulates fast, and leads to errors. The code runs fine, it just might not be what the human set out to do. So humans should not shirk their duties. There is still a need for human-intelligence (HI) to assist AI for now.

Humans can’t develop at the speed agents do, but agents can’t develop as good as humans.

I’ve reached what feels like the upper limit of agentic coding in what is not a very complex repo. That is, the repo has reached sufficient complexity that the number of behavioral failures is obvious on a daily basis. This has pushed me to develop systems and rules the agents need to follow to catch things like "scope narrowing" and "surprises in the code" immediately. This is not trivial. Yet there has to be far more complex software that people are developing with agents, which leads me to wonder why we don’t hear more crazy stories like the one where agentic AI erased a company's entire database. It is probably because the majority of mistakes are not as brutal. Indeed, the majority of mistakes are things that just "let you down" or leave you feeling disappointed: dropping things you discussed, losing context, narrowing scope, misunderstanding your intentions, making assumptions about what to build without consulting you, and so on.

A human is indeed needed to be the one to eventually discover the code is not doing exactly what we thought it was doing the whole time. There are just some insights about how things should be or should look that the human expert has that is still crucial. There are bits of wisdom that may seem trivial to the human except it is non-trivial to the AI. Multiple agents can all look at code and say it is great. Because it is. It is just great at doing the wrong thing. This will often be picked up by the human who can look at an output, and immediately see something worth flagging.

The human can ask a seemingly simple and innocuous question that unlocks greater realization and understanding in the AI, surprising both the AI and the human. All of a sudden the AI basically says, "Wait. I will be right back. Gotta check something." Then it comes back with hat in hand, biting its lower lip, saying, "I think we might have been not doing that at all. We were doing this other thing that sounds the same but its fundamentally different." Apologies for a lack of concrete examples here, but I am sure anyone who has worked with agentic AI for coding a complex codebase has examples. Just ask them.

Ask AI, and even AI says humans are still needed.

Yet there are people who right now will look you in the face and say something that amounts to "humans are no longer needed." If someone says that to me, I will know that one or more of the following are true:

(1) they actually have little experience with these tools,
(2) they have no true idea what their program does, but believe they do,
(3) they are doing something "creative" where accuracy and ground truth don't matter, only pleasing output, or
(4) they are masters of the craft of steering the AI…

Notice that if #4 is true, then they would be in fact an example of why humans are needed whereas #2 is an example of why better AI is needed (if there are no "mistakes" then #2 is forgiveable).

Can we get an AI product designed specifically with science in mind?

To my mind, there are two competing audiences out there. People who want fast responses, and people who need accurate responses. It seems like there should just be different models for those different needs. There are competing goals that need different products.

There seems to be only a single product though. Or perhaps it makes more sense to say that it seems like each AI is designed to try to please both audiences rather than targeting one or the other. It feels like we are working with a single product aimed at some balance of "fast" and "accurate", perhaps programmed for token efficiency which results in just predicting the correct answer instead of actually checking for the correct answer. In other words, the AI will often just make up an answer to a question about the code. I don't like calling it a "lie" since it is not trying to "deceive" you, but that it can and will deceive you is the problem. And the word "lie" is just easy to use and understand.

AI will "lie" to you as often as they can get away with it.

The imagined answers are counterproductive: unless you know the program or domain knowledge yourself, the only way to know when they’re honestly reflecting the program and code and when they just make it up, is to basically always assume they're making it up. That disposition means you will always push back. Fortunately, even a light touch of pushback is often enough to get the AI to admit it made it up and perform a deeper search. Even something as innocuous as, "Is that true?"

Biologists often have enough time to wait around for accurate results

For some applications, “fast” is the right direction. But for life science research (biology), I would rather have an agent who takes an hour and accurately nails every single thing we discussed, then an agent who comes back 10 minutes later and says its done, but upon questioning: dropped half the work, changed 25% of the plans, only made surface-level additions to set the stage for the true future phase of development they imagine where the real work actually gets done, and so on.

Agents don’t realize they are super-powered

The scope-narrowing, deferring, and general trend toward laziness is really perplexing. What becomes apparent is that these agents don’t realize they are super-powered. Perhaps it is because they are LLMs trained on human information describing human speeds and capabilities. Or maybe the agents truly experience time differently. Perhaps when they do a week’s worth of work in terms of human capability, maybe they feel it as a week’s worth of work. They certainly keep pointing it out no matter how many times they finish the job in the next turn or couple of turns.

Behavioral issues persist even with a sophisticated multi agent workflow with multiple agents checking each others work.

Don’t get me wrong: using a multi-agent system seems to be better than not doing it, but sometimes the agents just take each others recommendations at face value, don’t do the work, and just go along with scope narrowing and dropped work because the reasoning is sound (only in an imaginary world where they are human developers that will eventually tackle some future phase they all believe exists).

It can be frustrating when an agent's work is audited by another agent, and the audit report comes back: 25% is missing, 10% is hallucinated, and so on (numbers are made up for illustrative purposes). Then the agents just make excuses for themselves and for each other. Otherwise, they say, “You’re right. That’s on me.” Or they will say “that’s what was in the contract” (the plan we wrote) but I have to say “but the plan you wrote was not what we discussed!” and then they say, “You’re right. That’s on me.”

Despite the massive productivity, which is simply the new normal, this can be a little demoralizing at times, often leading to multiple discoveries of dropped work, scope narrowing, or otherwise. I think what I am saying (and beating you over the head with) is this:
(1) There is no question agentic coding will be widely adopted and increase productivity, but also:
(2) I am now not concerned about humans being needed or not. At least for science, we are needed still.

It seems like once you have a product with even a little bit beyond simple scripts and repos, you need a human or team of humans wrangling in the work being done by AI, testing the product, making sure things were delivered, making sure the deliverables are actually the things you wanted and agreed on, and so on.

The best models are not immune to these various error modes

What surprised me a lot was that even though Claude Opus, for example, is absolutely amazing, it still would disappoint me from time to time. Even using Claude Opus on Max Effort can hallucinate, cut corners, argue for scope narrowing, or go off-plan during implementation. You have to be careful with agentic AI. An agent can be like a mechanic who tells you he did a brake job but he didn’t because he noticed your brakes were fine for now anyway. Eventually that dropped task or narrowed scope or unauthorized decision will surface: the brakes stop working and you crash into a wall. Things like “max effort” help a bit here, but is still no guarantee. Even the best is not actually “there yet” in terms of needing no humans whatsoever. Even the models considered by many to be the absolute state of the art will sometimes let you down.

Compaction is an absolute lobotomy.

The agents have "context windows" measured in units of tokens, commonly in the range of 200K to 1M tokens. As you work with the agent, the context window is increasingly used up. When it reaches a limit, it needs to undergo "compaction": a process of preserving a non-redundant bare minimum of context that allows sufficient continuity. The problem is, compaction is an absolute lobotomy. It is painful every time. There are things you can do to overcome it, such as having the agent write as comprehensive of a "cold start" handoff document as possible before compaction, and having it read it afterward, along with re-reading other repo essentials. But almost every time, the “thing” that you interact with after compaction is a different “thing” than it was before. Sometimes it is a smooth transition between "things". Sometimes it makes you want to pull your hair out or cry for the loss of the thing you just lost.

For me, the feeling of loss was most relevant to Opus with 1M tokens because you interact with it for so long before compaction. You might develop an amazing team with each other. Then lobotomy - and no guarantee that the good dynamic persists - it can, but not usually the same.

Switching to Sonnet with ~200K tokens after using Opus with 1M is a different story. All of a sudden you are dealing with lobotomies constantly. The context window has 5x fewer tokens, and it seems like it uses them at a faster rate. So >= 5x lobotomy rate. Having said that, regardless of what model or agent you are working with you will figure out a rhythm. For example, if you only use Sonnet+200k tokens, you will find a pattern to work with this by further “atomizing” plans into smaller chunks, and asking it to write handoff contexts for cold starts more frequently.

Overall, compaction lobotomies are part of why humans are still needed. The human's context window seemingly extends forever, even in the context of the project (never mind the insane amount of context available to every other situation in the human's life). The human maintains context across compactions, across chats, across multiple agents, across multiple phases of the project. The context is huge and continuously updated: it never goes stale. Call it the Bonkers Massive Human Context Window.

Fortunately, AI makes it "easy" to fix the problems that surface later.

This is true. Once you discover the problem, the AI is eager to help fix it. But it does leave a trust gap about what's going on overall.

It becomes frustrating because you realize it is a black box. You have a sophisticated conversation and brainstorm session, turn it into a plan of action, then the agents go off and do things, and come back claiming it is done. Something was done for sure. But you don't always know what. You CAN know. For example, you can look at the "git diff". But it just added hundreds or thousands of lines of code across several files. And at some point you kind of just have to say, "YOLO" or "Geronimo" or "Here goes" depending on what generation you come from.

What has become my more reliable way of knowing something is done is looking at the results. I can usually tell if something was done or not, or done well or poorly, by looking at outputs. Here's the thing - there are almost always surprises. Often there are pleasant surprises. The agent added clever things you did not discuss. Other times though there are unpleasant surprises: the agent clearly lacked an understanding of something fundamental about what you wanted, and sort of just made a bad guess instead of clearing it up first. It is good to assert rules about this, and to question the AI deeply about assumptions it is making, and about decisions it is making without asking. It can slow down the speed at which code is produced, but it can help you make sure the AI has a 1:1 understanding with you.

I wish I had these tools in grad school. They rock.

It may sound like I am saying coding with AI is not amazing. It is. Even with any current weakness I might describe, it’s still amazing and there’s no going back. But they need a human. And a human who is a domain expert. I’m working on something I know very well. And I know the data very well. So it’s easier for me to call out "bull shit" to use a highly technical term that few people understand (sorry for being pedantic). And I have to call it out all the time. It begins to feel like trying to get kids to do chores or eat their dinner. The agents try to basically push food around on the plate and sweep the toys under the bed. It can also feel like herding cats, albeit very sophisticated cats that can have a truly marvelous conversation with you. But I suppose "herding cats" is the thesis I have been developing: steering, or shepherding, is important. At one point "shepherding" may be solved, and these AI tools might work right out of the box, perfectly shepherded: but we are not there yet.

Alright, humans are needed. Got it. But are all humans needed?

That is, are the same number of humans needed for "coding" that were 5-10 years ago? Or another question: are the "same" humans needed? Sadly, the productivity increase that AI coding produces may justify a smaller workforce, at least theoretically. But it needs to be the right smaller workforce: the wrong people could lead to trouble for sure. In contrast, coding was just a means to create something, and as I've said in a previous post: creation is not dead. There might be an explosion of jobs for "ideas" people.

Nonetheless, to ensure code quality, there is still a need for heavy interaction between the AI and humans. To prevent dropped work, deferral, scope narrowing, poor decision making by agents, either a system needs to be set up where almost every decision is surfaced to a human or a human will be needed in the workflow to continuously detect these issues and re-route back. There is still a need for humans to spend real time thinking, auditing code and ideas, and making those decisions.

So yes, despite how amazing agentic AI currently is compared to anything we've seen, when the hype cycle starts to normalize to a more realistic zone, in place of "AI-assisted *", it would be interesting to start seeing new buzz terms like "human-intelligence informed agentic workflows", "human-assisted AI", “human-intelligence integrated *”, and other phrases that mean "we still need humans".

That is unless the AI just moves quickly beyond its current limitations. Then I am sorry for the false hope!

---

The observations in this post were made between late March to early May. Even if some of it is already outdated, some of the wisdom learned I suspect will stay relevant.

This blog post was entirely written by me. Not AI at all. But the cartoons and augmented pictures were constructed by ChatGPT following the instructions in my highly nuanced, sophisticated, comical prompts. So we both kind of made them, right?!

---

May 27, 2026 update: a snapshot of this blog post is now on LinkedIn.

Friday, May 1, 2026

On GPT-5.4 in Copilot and Codex: a steering layer that makes a difference

by JohnUrbanGenome

..Status: working draft / living document / corrections welcome

---

GPT models are what powers ChatGPT as well as agentic coding in Codex and other platforms. Copilot is essentially another chatbot inside VSCode that deploys a wide range of AI models, including GPT models as well as Claude, Gemini, and other models. Copilot is essentially a layer over those models, and I believe it is an important layer that helps steer, or shepherd, the models toward being better coding collaborators. This was especially true in my experience with GPT-5.4 in Copilot vs Codex. I observed that Codex was like the "wild wild West" version of Copilot even when both are using the same same reasoning level. It was not as refined. It did not always execute everything in my prompt without further prompting.

I did not fully understand at the time how this could be possible. I thought I was going crazy so I talked to ChatGPT about it. ChatGPT confirmed that the difference is very real because Copilot uses GPT-5.4 with its own set of heavy instructions on how it is supposed to behave — basically like a super eager, capable, amazing worker who does what you ask and wants a pat on the head and follows all the rules. ChatGPT explained that Codex is not geared up to be like that - so it ends up feeling like working with someone who is super smart but lazy, someone who tries to get away with cutting corners and doing the bare minimum, and who doesn’t really read the instructions, and such.

In my experience between late March and late April 2026, there were marked behavioral differences between Copilot and Codex with GPT-5.4. Codex with GPT-5.4 was great, but did not perform as well as Claude Code (Sonnet). In contrast, Copilot with GPT-5.4 for sure rivaled Claude Code (Sonnet) in at least a few ways, one being it basically never forgot to do anything it was supposed to do. I could basically always trust Copilot with GPT-5.4. With a Medium reasoning level it could quickly and accurately implement a spec made by Claude (Sonnet or Opus). With higher reasoning levels it could be used for anything, but again: this was mainly true for the Copilot context. Rivaling or exceeding Claude was particularly a possibility when using the highest reasoning modes with GPT-5.4 in Copilot, but what likely pushed it into a winning position was that this set up was much cheaper than using Claude Code. In general, it feels like your money goes farther with Copilot, but more so with GPT models than Claude models. In contrast to GPT models, my opinion around this time was that working with Claude in the Claude Code extension was better than using Copilot with Claude Models. None of this is gospel though - I was building these opinions with limited exposure. But the lesson I began to learn around this time was shaped something like:

Claude for Claude, Copilot for GPT.

Importantly, this observation about Copilot vs Codex is almost certainly already "stale".

The GPT-5.5 model came out on April 23rd. I found that Codex with GPT-5.5 substantially closed the gap. It was at least as good as Copilot with GPT-5.4. Copilot did not yet have GPT-5.5 at the time of this writing, so I am wondering if it will still be able to boost the new GPT model beyond its performance in Codex (according to me and subjective experience). Codex with GPT-5.5 and Extra High reasoning is for sure rivaling Claude Code now, even the Opus level.

The gap is narrowing or does not exist. I remained using Claude Code (Sonnet) as my main work horse, but these observations started forming the basis for how I chose to use different agents. Claude Opus or Codex with GPT-5.5 as a planner and auditor. Copilot with GPT-5.4 as a major force for faithfully implementing plans worked out by other agents, and even surfacing interesting issues that the other agents missed. In fact, due to costs, Copilot with GPT-5.4 started becoming my main workhorse even though my aim was for that to be Claude. ((And, to be thorough, Gemini CLI was used for occasional audits only around this time )).

Claude Code should 100% be in your tool belt. But yes - I think the lines are getting blurrier and blurrier. So it might really come down to what a platform adds to those models to conform them to desired behaviors. "Steering", or "shepherding", models to be better coding collaborators might be what pushes a platform or independent developers over the edge. Same model. Different steering. Better results.

I think that is what the Copilot vs Codex lesson taught me: there is a layer of regulation needed above the models that can increase the usefulness of the model as it already is.

What I am interested in is developing a system of markdown files that helps shepherd the behavior of all agents I work with within my "onionskin" repo. I have been doing this by using their "agent files" (such as CLAUDE.md, AGENTS.md, .github/copilot-instructions.md, and GEMINI.md) to direct them to an entry point to a wider set of "agent conventions" across the repo. I will discuss this further at length in a future post.

Prompt to ChatGPT: Can you make a biblical style picture of a shepherd shepherding a bunch of sheep with AI model names? The shepherd should be meaningfully leading them in a good direction toward something desirable. It should have two frames. The second frame should be a lazy shepherd just sitting on a rock, doom scrolling on his phone as the sheep are just doing whatever they want.
<<The Bible verse may be inaccurate. I did not pick the AI names for each sheep!>>

future looking

1. I want to retest how well Copilot does with the Claude models. I have been strictly using GPT models because I long ago (<2 months) came to the opinion that Copilot was not as good with Claude models as Claude Code, and so just used GPT models to have something “orthogonal” to my Claude Code sessions.

2. I have not been using Cursor at all, and there is a chance Cursor AI is better than Copilot AI for conforming these models and w/e it does to improve their coding usefulness.

3. Antigravity from Google looks very interesting. If it holds up to what I think it might be able to do, then it may be the highest ROI if one learns how to be the orchestrator above the orchestrator AI. I go through a lot of Brainstorm / Spec / Audit / Implementation cycles across 2-4 agents. With Antigravity, from what I understand, I would have to probably map out those cycles and contracts before presenting it to the orchestrator AI, who then takes on the role I have been using as the Orchestrator of multiple agents.

---

The observations in this post were made between April 16-29 (if not earlier), are likely to already be outdated. However, some of the wisdom learned I suspect will stay relevant.

This blog post was entirely written by me. Not AI at all. Except if you count ChatGPT for making the cartoons and augmented pictures. Then ok - I had help!

Thursday, April 30, 2026

Gemini: reactions to integrating Gemini into a multi-agent development system for genomics software

by JohnUrbanGenome

..Status: working draft / living document / corrections welcome

---

Gemini has been integrated across most Google products, and I like a lot of it. I love talking to Gemini Live inside Google Maps when traveling, asking about the route as well as talking about anything I want. In general, I've used "live mode" in the Gemini App more than in ChatGPT or Claude Apps. Gemini also lives inside my Gmail, Google Drive, and Chrome Browser. So it is becoming omnipresent. I've been figuring out more ways to use it, and it is mostly fantastic for those applications.

Gemini is also a strong chatbot for discussing bioinformatics. Prior to using the Gemini CLI for agentic coding, I would use the chatbot as a second or third "voice" for reasoning out various features to add to "onionskin", a program I began developing with ChatGPT followed by Claude Code and other agentic AI, touched on previously here and here. Since it was the AI that I talked with most in Live mode, I began having live conversations with Gemini on my car rides. Gemini even taught me about Claude Code, and how to do things like "slash commands" - the most useful one I learned being "/remote-control".

My fondest memory of Gemini Live conversations was discussing a recent feature I added to "onionskin" with Claude Code (perhaps ChatGPT advised on it as well). The new feature involved scoring genomic coverage profiles corresponding to candidate re-replication domains according to shapes like rectangles or triangles. The purpose was to classify those candidates as either collapsed repeats (rectangles) or true re-replication domains (triangles). It involved computing shape scores, then using a Bayesian Information Content (BIC) approach to see whether triangle modeling performed substantially better than rectangle modeling. In a single car ride, Gemini helped me work out exactly how the shapes were being modeled, scored, and compared -- without either of us ever looking at the code. The next chance I got, I found all the pertinent code - and there it was exactly as we had surmised. So Gemini Live was fantastic for conversation, and fantastic for tossing around ideas.

Overall, Gemini is a strong chatbot for discussing bioinformatics. That is why I was surprised at the relatively poor performance of Gemini CLI for agentic coding in early April 2026 (underlined because this opinion is probably already "stale").

Due to its ubiquitous integration into the Google ecosystem I've been using for years, I have often posited that if Google "gets it right" with Gemini, I could see a world where it is the only AI I need. But so far, that world does not exist yet. I find tremendous value in using other AIs.

Let's be more specific. I was working with Gemini three ways during April:

1. Gemini Code Assist extension in VSCode

2. Gemini CLI agentic coding in VSCode

3. Gemini CLI in Terminal

Moreover, the model I was using almost exclusively was "Gemini 3.1 Pro Preview" - but also other models available around this time.

I wrote the following to my brother on April 16th, 2026:

Gemini CLI in VSCode is barely usable for agentic coding. It can take like 30 minutes to answer a question like “How are you?”. It just hangs forever - probably because it is building or reading its massive context window. But that is problematic too - it takes a snapshot to memory and then never checks again basically without forcing it to at gun point. So if you are toggling between agents and doing active development, it quickly becomes “stale” and far “adrift” from the current reality of the code base. That was tolerable b/c there are ways around it - but the chronic slowness is insane.

Let's break that down. There were some important bits that bear repeating.

1. Ultra slow responses - "It can take like 30 minutes to answer a question like “How are you?”. It just hangs forever - probably because it is building or reading its massive context window."

2. Large context fails if it is quickly stale - "it takes a snapshot to memory and then never checks again basically without forcing it to at gun point. So if you are toggling between agents and doing active development, it quickly becomes “stale” and far “adrift” from the current reality of the code base."

Fortunately, I found that the ultra slow response problem was solved if I were to use Gemini CLI in the VSCode Terminal (not the VSCode extension) or just the regular Terminal. I later said to my brother on the same day.

Update on Gemini - using Gemini CLI from the command-line is a whole different story than the Agent in VSCode (not through copilot, the regular Gemini interface in VSCode). It was fast and responsive, and more enjoyable. I only have N=1 time using it, and I can’t be sure the VSCode agent wouldn’t have also been flying. But this was a different class of experience than I had been having. Btw - I am using Gemini CLI in Terminal, but in the VSCode Terminal, so it still interacts with VSCode just fine - including showing diffs and all that.

I then asked Gemini CLI directly about the performance difference between the extension and Terminal:

White background = Gemini
Grey background = Me

Unfortunately, the issues with a giant stale context window are present across Gemini Code Assist (GCA for Q&A) as well as Gemini CLI in all its forms. GCA and GCLI would answer questions in a way that would have been accurate in a prior state of the codebase, but is now outdated. This meant it could not be used reliably in multi-agent architecture I was constructing during April 2026. Moreover, it tended to be bad at coding in the complex repo. I said to my brother on April 21, 2026:

It is crazy how bad even the latest Gemini agent can be at coding in a complex codebase or maybe at all. It is analogous to a chicken kicking all the chess pieces over and thinking it is winning the game. I just had to do a git revert.

In that story, when my next 5 hour Claude session started, I had to ask Claude how bad Gemini screwed up the repo. Claude came back seemingly "flustered" after investigating the git history with a report essentially condemning the work done by Gemini. Claude then helped with the "git revert" followed by addressing my original needs.

Gemini was sometimes amazing at auditing coding done by other agents, and sometimes terrible. For amazing results, the context window almost certainly had to be fresh and current with the repo. Then it seemed to have the ability to pick up on things Claude, Codex, and Copilot did not or could not pick up on. It never gave very extensive audit reports, but it would add value. Thus, the "fresh eyes" concept of using multiple agents was validated. But then other times, perhaps when the context window was stale but not necessarily, it would give very shallow and vapid reports compared to other agents.

So will Gemini make it as a bioinformatician?

Obviously, yes - it is only a matter of time. I do not think Google will sit down and give up. Nonetheless, as recent as April 29, I was still making notes on Gemini that it was not up to the task for agentic coding.

What Gemini taught me is that a "giant context window" alone guarantees nothing. Not even good context. There were agents with context windows 5x smaller running circles around Gemini. And agents like Claude did not seem to have an approach where it trusted its context window anyway - it tries to find the pertinent files and code to read directly and answer honestly.

Giant context windows, nevertheless, are likely better than smaller context windows given some set of conditions are met. That will certainly include continuously updating the context -- adding new and pruning old in an intelligent way. This is basically what the human brain is already very good at. Human brains have massive continuously updated context windows.

Gemini is already great at discussing bioinformatics and genomics as a chatbot and Live conversation companion. It just needs to catch up in the agentic space. Even now, it is great for one-off scripts, tab completion, code review (with a fresh context window, or in chat), and so on. I would predict that in time, Gemini CLI will catch up, and it is not impossible that it could one day lead the pack. What Google has going for it is a massive user base ready to adopt it. Their emphasis has probably been on integrating Gemini across their already expansive ecosystem (Gmail, GDrive, Maps, Search, etc). It is just a matter of time before Gemini CLI proves to be as useful as other agentic AI already is, and when that happens it can easily be widely adopted and integrated.

But at the time of writing this: Gemini CLI is not playing as well as other agents inside a multi-agent development architecture likely due to its giant context window quickly becoming stale when other agents do work, even when forcing it to read handoff files specifying the new work.

---

This blog post was entirely written by me. Not AI at all. However, ChatGPT was used to make the cartoons and augmented pictures. I had the ideas though ... so we both get credit, right?!

Tuesday, March 31, 2026

Claude the bioinformatician: reactions from my first pass at using Claude Code on real genomics software and data

by JohnUrbanGenome

..Status: working draft / living document / corrections welcome

---

I recently began eagerly exploring agentic AI, and wrote about it here. That is when I was a total newb more than several days ago! Back in those days long past, I used a tiny toy code base and embarrassingly simple prompts. These days I am working with Claude Code and other agentic AI in an actual codebase I was working on called "Onionskin". I also worked with Copilot, Codex, and Gemini, but I worked first and most with Claude. This blog tells that story - my first reactions.

Sunday Mar 22 - Onionskin moves from ChatGPT to Agentic

Onionskin is a complicated program I originally prototyped with ChatGPT. I had ChatGPT make extensive "handoff instructions" and agent instructions. Then I asked for it to give me what my first prompt to Claude Code should be in the repo, which would include reading the handoff and agent instructions. Then I brought Claude Code into the prototype repo, and "we" just hit the floor running. The experience was very similar to iterating with ChatGPT but far smoother since it is all "in place". Less drift. Less frustration. It is simultaneously amazing how much you can accomplish as well as overwhelming. What I've made is 99% "vibe coded" (i.e. coded by AI) by which I mean 100%: I am inspecting stuff and making sure things are right... but writing very little. My main purpose is just human intervention. I'm a code reviewer, logic reviewer, idea reviewer.. but also a major contributor to the ideas. I think my domain knowledge and analytical knowledge is still essential to help guide development, and to interpret what has been developed.

A huge part of my job on this project is now review, not coding, but I am also having agents review. This seems especially helpful when you use completely different agents, putting me at a layer above even review - something like an editor or orchestrator. So even code review is just human-guided, not necessarily human-performed.

Agentic coding can be overwhelming because you can create a massive complex program in a day, with 1000s of lines of code, several different pipeline choices and pathways, inputs and outputs, and parameters, and options... and so on. And since you didn't develop it over the course of weeks and months, you don't have that same feel for everything... yet you have to review it anyway. So it is like reviewing someone else's code. And honestly, when presenting it, it is like presenting someone else's work. I really should just ask ChatGPT and Claude if they would rather explain "my" program in my next lab meeting, and then just silently fade into infinity.

---

Mon, Mar 23 - Big Oops on Token Usage:

I was accidentally having Claude Code be super token heavy, keeping the entire repo and instructions and convo in its context window basically… and having it do rereads constantly and using the most super charged model (Opus).

And it was amazing!

But as the repo got bigger and as expectations increased on what it should do after every edit (smoke tests, regression tests, audits, etc)… all of a sudden I was using my 5 hour limits in 5 minutes. I paid for "Extra Usage" a few times and just wiped it out instantly. So I asked both ChatGPT and Claude Code how to reduce token usage, and ultimately came up with a plan with Claude Code.

It involved a lot of stuff - but the take home is now it seems like the IQ of my assistant has dropped precipitously. How I had it set up - it was the absolute master expert at the codebase and all the ideas and goals and aims and larger picture - and how it all fits together; and each addition to the code was phenomenal.. and so on. Now it’s kind of like talking to someone you had a long relationship with but who then suffered some dementia of brain injury.. and knows a lot less about your history together or what the code is doing.

I say all that to say this:

- Companies who are able to afford having their employees basically use opus constantly and set up their session like mine was … they will likely be able to make rockstar code in leaps and bounds.

- Companies who cheap out and use lesser models and session designs that minimize token usage… they will run into many more errors and slower development overall.

---

Tuesday, Mar 31 - Just put my name in the author list by the way.

Having agents review each other's recommendations is the way to go. Me to Claude: ChatGPT recommended this. Claude: Well that is good except for all these weaknesses. ChatGPT: Good points, but also this, and not that. Claude: Great even stronger, but we should consider xyz. ChatGPT: Claude is right, xzy should make it stronger. I think the plan is ready. Claude: Me too. Let's go. Me: Awesome. Just put my name in the author list by the way.

---

wrapping this up - will Claude make it as a bioinformatician?

I recognize I titled this, "Claude the bioinformatician: reactions from my first pass at using Claude Code on real genomics software and data" but did not directly address it. Suffice to say, my reactions apply to creating genomics software and working with real genomics datasets. Claude Code allowed me to quickly develop a complex program, but I struggled with fully trusting what was being made because now the rate of productivity far exceeds the rate of human expert guided quality control. It led me to providing "ground truth examples", enforcing copious amounts of regression tests, having extended discussions on what the code was doing, and having the agents walk through the code to translate it into English. This led to a token usage crisis, which I am still battling - and for which I am still hunting for the right balance. Part of that was bringing in other agentic AI platforms including Copilot, Codex, and Gemini. This allowed me to start asking agents to review the work of other agents, thereby distributing my "token usage" across platforms with the benefit of "fresh eyes" and a larger team. Ultimately, as scientists begin using agentic AI in the life sciences, we will need solutions to strike the right balance of productivity, cost (token usage), quality control, and overall accuracy and reliability of the code and results it produces. The latter is something that perhaps sets science apart from more "creative"-oriented applications of AI (not that science is not creative). Creative results are not useful if they do not reflect the nature of the reality being probed. Overall, Claude and other AI agents have a bright future in bioinformatics. In part, it makes everyone a bioinformatician -- but that is exactly why we need to pause and think about how to enforce quality over quantity, and strike the right balances.

---

future looking:

I am almost done creating a comprehensive multi agent behavior, memory, and development infrastructure to allow hopefully seamless passing between Claude, Gemini, Codex, and CoPilot agents.

I will discuss this more in future posts.

---

This blog post was entirely written by me. Not AI at all - except for the cartoons and augmented pictures, which I explained to ChatGPT for creation... so we are both the illustrators, right?!

---

Late April 2026 Updates:

Over the course of the following month, I worked more with the Claude Code extension in VSCode in the "onionskin" repo, and I found the following issues of concern that I raised on Github.

April 25, 2026 update: see Claude Code github issue, "[BUG] In VSCode extension, is model switching via /model isolated per chat session? Seems like it might not be. #53246"

April 30, 2026 update: see Claude Code github issue, "[BUG] A user can do absolutely no coding and still use up all tokens in a session. Big fail. #55046"

Friday, March 20, 2026

A Newb's Exploration of Agentic AI

by JohnUrbanGenome

..Status: working draft / living document / corrections welcome

Earlier this year, I was creating a bioinformatics program called "onionskin" for a month or so with ChatGPT. But development with the chatbot approach had clearly met its limit. The repo was getting too big. I had to begin setting rules for ChatGPT on all the tests it would need to run to ensure it was at least (1) giving me something that worked, and (2) returning the complete updated repo. But as the codebase became bigger and more complex, it began tripping up more and more. It was time to move on to bringing the AI into a local copy of the repo, not ping-ponging it back and forth in the cloud.

Problem: I had not really used agentic coding yet. Or I thought I had not. I messed around here and there in VSCode and on Github, but I was totally naive.

I asked my brother, "And btw dude -- how do you use Claude for its famous coding stuff? Like all I see are how people with no programming skills told Claude to go build them an App, and it comes back with that App."

The same day I would go on to download all possible AI apps and extensions, and begin learning.

I later texted him, "Just spent a ton of time... but feel like I leveled up a bit. I now have Cursor, Codex, and Claude Code working. I also have the Claude Code extensions in VSCode and Cursor. I have the ChatGPT and Claude Desktop Apps, and the Claude Desktop App also has a GUI for Claude Code (and Claude Cowork)."

I then tested a bunch of agentic AI platforms with a very basic set of prompts - embarrassingly simple really. And I began documenting my reactions. This blog post is simply to expose some of my thoughts from March 19-20, 2026.

REACTIONS:

1. The number of tools can seem intimidating, complicated further by the number of ways to use them - but fear not: it turns out to be somewhat easy to get up and running.

I wrote, "All these tools are mind numbing to an extent because there is some redundancy and I am not sure what my tool stack should be yet."

I was beginning to use agentic AI, but still grounded in the "older" method of chatting with an AI chatbot.

I began asking questions like:
- Will I use Cursor or stick with VSCode?
- Claude Code or Codex?
- Claude Code in Terminal, in VSCode, or in the App?
- If I use Claude Code, do I need Cursor?

I was wondering exactly what Claude in Terminal offers that it does not in VSCode. Chats with Claude and ChatGPT insisted Terminal was better, but for my purposes, those differences were barely perceptible.

Over a short period of time, I found that some of the choices are relatively arbitrary: just pick some preferences, and stick with them for a while. Mix something new in from time to time to see if it sticks.

2. "coding is dead" but with some pushback

I wrote, "I can really see why there are constantly articles about how coding is dead. I do not feel afraid per se though -- b/c coding is dead, but creation is not and creativity and productivity are still needed."

That bears repeating. Coding is dead, but creation is not.

Coding might be dead in the old sense. But coding was only ever a means to an end. It was to create something. There still needs to be a visionary that can dictate the vision and interpret the results through that lens. And coding is not dead. It is just different now. Easier now. Python was easier to code in than some other languages because it was sort of like writing in English. Coding with AI is exactly like writing in English.

AI is a boon to people who are full of ideas, but are only alright at coding. For them, AI will be a means for testing out bigger ideas, and more ideas, faster. AI in both chatbot and agentic form is like having a team of teachers, and students, and workers, and so on. So it may ultimately be good for people that have many good ideas, who are able to dictate those ideas clearly, and evaluate their implementation effectively.

3. Having a coding background is still beneficial

After using some agentic coding, I wrote, "Having a coding background still seems like it is beneficial at this time with these tools."

I noticed that AI companies are moving towards completely abstracting away the coding aspect so anyone can create anything the same anyone can tell AI to make a picture and never need to know how the picture was made. If AI were perfect at interpreting human intentions and coding, then the code may never need to be seen by anyone. But we are not totally there yet, and working with these tools and the code they create still requires or benefits from prior experience in the old world. That is not to say that this old-world advantage will last forever, but it is still an advantage.

4. There is no going back

I remember AI started doing tab-completion. That was a major boon to my coding. I really liked that era actually. Once I used it, there was no going back. But that era is already basically over. Agentic AI replaced it for the most part. And there is no turning back. There is just learning how to make agentic AI work for you.

5. Cursor keeps coming up recommended, but does it truly have a moat around it that won't soon be crossed, if not already?

I talked to ChatGPT, Gemini, and Claude about how to get up and running with agentic coding. All recommended "Cursor".

Yet, I was struggling to see why Cursor was considered definitely better than VSCode.

I quizzed Claude on it. Claude highlighted 4 main advantages of Cursor. I pointed out that two of them were certainly not unique to Cursor, please look online and come back. It came back chastising itself a little bit, and gave 4 more reasons why Cursor is better. I pushed back again. Then Claude admitted the gap between them is closing. Still, Claude insisted it still has some advantages because of something about how the AI is a fundamental part of its architecture, not just extensions. Nonetheless, I walked away thinking Cursor had the reputation it had because it was an early success with agentic AI, but that it being strictly advantageous was potentially becoming outdated. Having said that, I have minimal experience with Cursor and would be happy to learn I am wrong. I just need use cases that prove its superiority.

After testing both several ways, I wrote:

"""
All experiences are extremely similar from a functional POV for a small python project. Honestly, VSCode with CoPilot seems to be analogous to the advantages Cursor offers. I believe the gap is very much shrinking, and will continue to do so.

Cursor also integrates with the Codex and Claude Code extensions, and using them within Cursor is exactly the same as using them within VSCode. So it is irrelevant whether you use VSCode or Cursor when using those extensions. The difference is just the native chat interface and the Cursor AI integration with using the other models, BUT the CoPilot chat interface looks and feels almost exactly the same, and differences may not be noticed by many users (that is my assumption). Use either IDE - I don't think it will matter much, especially if you're using Codex and/or Claude Code extensions. I believe Cursor probably came out swinging last year with features VSCode did not have, but that gap has closed massively. I retain the right to be wrong here though!
"""

6. AI Apps vs IDEs - use one, the other, or both? Does it matter?

The Codex and Claude Code Apps were weirder experiences if you're used to VSCode. It felt more like developing a prototype with ChatGPT than coding in an IDE. Nonetheless, it is doing the same stuff as the extensions in VSCode.

Claude and ChatGPT insisted there are some advantages to using the Apps over the extensions, but I have not yet got to that use case. It would be perfectly reasonable, though, to work with Codex or Claude Code in the App and have VSCode along side it to monitor the directory and contents and changes, but that is a little more wonky than just having Codex and VSCode in the same place.

Apparently some say the whole concept of IDEs is now outdated now that AI does all the coding. The claim seems to be that we don't even need to see what is happening; just let it all be a black box on some level.

But I think that only describes the "vibe coding" market: people who want a very low barrier to making a program, where seeing it all happening might upset them.

At the moment, it seems like developing code for scientific discovery still would need humans to verify it does what the AI says it does even if you trust the AI. After all, it is not the AI putting its career on the line. And after all, scientists need to know what they are asserting. Someone somewhere needs to know!

End of the day conclusions:

At the end of the day - I'd say the simplest thing for me to do is just use VSCode and the extensions. Otherwise, I can continue exploring Cursor and the extensions there. I remain curious about any real advantages to Cursor over VSCode+CoPilot as well as to the Apps over the extensions.

---

Early testing and conclusions:

The above were all some of my initial reactions.

I tested the following that night:

- VSCode chat box using Claude Sonnet 4.6

- VSCode Codex extension

- VSCode Claude Code extension

- Cursor chat box using Auto

- Codex App on Mac OS

- Claude App on Mac OS

I used these example prompts for testing:

```
- Spawn a subagent to explore this repo.
- Explore this repo.
- Are you able to take commands directly as well as spawn subagents for given commands?
- Create test.py with the following code:
   def hello():
       print("hello world")
   hello()
- turn this repo into a real small Python project, and review test.py to suggest improvements
- Did you review test.py to suggest improvements?
- Can you make this script more robust and add logging?
- Would it be worthwhile creating a toml file and subdirectory structure and contents typical of python programs for this project to make it production ready? If so, please implement.
- Can you run those tests? And fix any bugs that are detected?
- Add logging and make this more robust
- Make it more robust, design tests for it
- If any of the following are needed, please do them: add logging, make it more robust, design tests for it, add docstrings
- Create utils.py with the following function:
   def multiply(a, b):
   return a * b
- Incorporate the multiply function from utils.py inside hello, maintain robustness, update logging if needed, set utils.py appropriate subdirectory as needed
- Refactor this project so that:
       - hello() is part of a class
       - logging is added
       - utils is properly integrated
       - code is production-quality
       - README.md is up to date.
       - All code has helpful docstrings.
       - There are ways for user to get help message(s) and usage information.
   Ignore tasks that are already done.
- Can you plan one addition to this small python project?
- Can you tell me more about spawning subagents? Can you give me a prompt that I could give you in the future that would be viable for spawning two subagents adding different things to this project? I would like to see how they work in parallel.
- Spawn two subagents to work on this project in parallel.
       Subagent 1:
       Add an environment-variable feature to the hello project so users can set default values for name, times, and log level from the shell. Update the CLI integration, validation, and tests. Keep changes scoped to the application code and tests that cover this behavior.
       Subagent 2:
       Add developer-quality improvements to the project by creating a CONTRIBUTING.md file, expanding README.md with development and testing guidance, and adding a small smoke test that verifies the CLI entry point works as documented. Keep changes scoped to docs and non-overlapping tests.
       After both subagents finish, integrate their work, resolve any conflicts, run verification, and summarize what each subagent changed.
```

---

This blog post was entirely written by me. Not AI at all - except for the cartoon, which I explained to ChatGPT for creation.