Biofinysics

Tuesday, May 5, 2026

Good news: humans are still needed in science, and there has been no better time in history to do science than now with AI!

..this is a living document that I will update as I go....
...I do not guarantee it is finished now or ever will be...
...it is safe to assume I'm open to talking about this and am willing to learn...
..this is version 1.0; 2026-05-05
..previous versions: none

..Status: working draft / living document / corrections welcome

Learn to (review thousands of lines of) code

Before I started working extensively with agentic AI, I was fearful for my career. I had been working in biology research since 2008, doing bioinformatics since 2009, doing genomics and genome-wide analyses since 2010-2011. You could call me some sort of computational biologist, or at the very least: a person in the biology space who does most work on a computer, coding.

And that is why I was nervous! Every headline I read was something like, "Learn to code? More like learn to do physical labor!" or "It turns out smart people are dumb! (And no longer have jobs)".

Granted, clicking on one of those News stories guaranteed 1000 more would be shown to me. I knew that. But it felt like the world was turning upside down, and I wanted to understand what I was up against.

Then I spent a month+ building genomics software with Claude Code, Copilot, Codex, and Gemini.

And I felt better about my prospects.

Why?

At least for science, humans are still needed.

Briefly: yes agentic AI is amazing, but when using agentic AI for an extended period in a repo getting more and more complex, you start to notice things. You start finding out much later in development that a feature never "landed" or something was implemented entirely wrong, or an agent silently decided to only set the groundwork rather then "wiring it up", or the agent wired up some superficial patch that passes regression tests it designed… but doesn’t work for real in the wild.

You spend an evening ironing out a set of plans with the agent, launch it before bed, and wake up in the morning to find the agent skipped almost everything, changed directions, and made up its own happy little thing to do. You take a deep breath and politely ask the agent to audit itself against the original prompt, and then it starts the apology tour.

The AI can get stuff wrong. It can sometimes take short cuts, make band-aid code and quick-fix patches that mask deeper issues. It can take a complex idea and over simplify it, even iteratively simplify it across a session, mutating the intention behind it completely.

The language in this blog post is quite reserved compared to texts I was sending my brother. As someone in the trenches with agentic coding in a codebase for genomics software ("onionskin" for detecting re-replication domains given coverage bedGraphs), even though the pace of development is unrivaled, there were times when I just threw my hands up in frustration.

But then I would realize it is good news: humans are still needed.

Here is the best framing I have so far: if agents were people working at people speed, you’d fire them.

The agents can be so strongly predisposed toward these "lazy" behaviors that they can even adapt to do the same things when you try to set up shepherding architecture to steer their behavior. Or if not adapt, then simply start ignoring the instructions and apologizing later. I can imagine them thinking, "Ask for forgiveness, not permission."

We tolerate the behavioral failures, described more extensively below, because the speed of productivity is massively accelerated by AI. Thousands of lines of code can be trivially produced in one sitting. It is truly incredible. It is "high throughput coding" reminiscent of the "high throughput sequencing" revolution in genomics where suddenly millions or billions of sequences were trivial to produce. In both cases, the high throughput nature opens the door to exciting new possibilities. However, in both cases, it also means that humans are no longer able to simply sift through the output. This is one point of current tension between quantity and quality control in agentic coding.

The problem is we humans do not have the ability to work at the speed that AI produces content.

In order to keep pace, we feel some pressure or desire to trust the work being done. To keep the work moving, we are tempted to permit assumptions at each step, to assume that the agents did what they said, and what was written is what was discussed, and what the code is doing is right, and so on. But the drift accumulates fast, and leads to errors. The code runs fine, it just might not be what the human set out to do. So humans should not shirk their duties. There is still a need for human-intelligence (HI) to assist AI for now.

Humans can’t develop at the speed agents do, but agents can’t develop as good as humans.

I’ve reached what feels like the upper limit of agentic coding in what is not a very complex repo. That is, the repo has reached sufficient complexity that the number of behavioral failures is obvious on a daily basis. This has pushed me to develop systems and rules the agents need to follow to catch things like "scope narrowing" and "surprises in the code" immediately. This is not trivial. Yet there has to be far more complex software that people are developing with agents, which leads me to wonder why we don’t hear more crazy stories like the one where agentic AI erased a company's entire database. It is probably because the majority of mistakes are not as brutal. Indeed, the majority of mistakes are things that just "let you down" or leave you feeling disappointed: dropping things you discussed, losing context, narrowing scope, misunderstanding your intentions, making assumptions about what to build without consulting you, and so on.

A human is indeed needed to be the one to eventually discover the code is not doing exactly what we thought it was doing the whole time. There are just some insights about how things should be or should look that the human expert has that is still crucial. There are bits of wisdom that may seem trivial to the human except it is non-trivial to the AI. Multiple agents can all look at code and say it is great. Because it is. It is just great at doing the wrong thing. This will often be picked up by the human who can look at an output, and immediately see something worth flagging.

The human can ask a seemingly simple and innocuous question that unlocks greater realization and understanding in the AI, surprising both the AI and the human. All of a sudden the AI basically says, "Wait. I will be right back. Gotta check something." Then it comes back with hat in hand, biting its lower lip, saying, "I think we might have been not doing that at all. We were doing this other thing that sounds the same but its fundamentally different." Apologies for a lack of concrete examples here, but I am sure anyone who has worked with agentic AI for coding a complex codebase has examples. Just ask them.

Ask AI, and even AI says humans are still needed.

Yet there are people who right now will look you in the face and say something that amounts to "humans are no longer needed." If someone says that to me, I will know that one or more of the following are true:

(1) they actually have little experience with these tools,
(2) they have no true idea what their program does, but believe they do,
(3) they are doing something "creative" where accuracy and ground truth don't matter, only pleasing output, or
(4) they are masters of the craft of steering the AI…

Notice that if #4 is true, then they would be in fact an example of why humans are needed whereas #2 is an example of why better AI is needed (if there are no "mistakes" then #2 is forgiveable).

Can we get an AI product designed specifically with science in mind?

To my mind, there are two competing audiences out there. People who want fast responses, and people who need accurate responses. It seems like there should just be different models for those different needs. There are competing goals that need different products.

There seems to be only a single product though. Or perhaps it makes more sense to say that it seems like each AI is designed to try to please both audiences rather than targeting one or the other. It feels like we are working with a single product aimed at some balance of "fast" and "accurate", perhaps programmed for token efficiency which results in just predicting the correct answer instead of actually checking for the correct answer. In other words, the AI will often just make up an answer to a question about the code. I don't like calling it a "lie" since it is not trying to "deceive" you, but that it can and will deceive you is the problem. And the word "lie" is just easy to use and understand.

AI will "lie" to you as often as they can get away with it.

The imagined answers are counterproductive: unless you know the program or domain knowledge yourself, the only way to know when they’re honestly reflecting the program and code and when they just make it up, is to basically always assume they're making it up. That disposition means you will always push back. Fortunately, even a light touch of pushback is often enough to get the AI to admit it made it up and perform a deeper search. Even something as innocuous as, "Is that true?"

Biologists often have enough time to wait around for accurate results

For some applications, “fast” is the right direction. But for life science research (biology), I would rather have an agent who takes an hour and accurately nails every single thing we discussed, then an agent who comes back 10 minutes later and says its done, but upon questioning: dropped half the work, changed 25% of the plans, only made surface-level additions to set the stage for the true future phase of development they imagine where the real work actually gets done, and so on.

Agents don’t realize they are super-powered

The scope-narrowing, deferring, and general trend toward laziness is really perplexing. What becomes apparent is that these agents don’t realize they are super-powered. Perhaps it is because they are LLMs trained on human information describing human speeds and capabilities. Or maybe the agents truly experience time differently. Perhaps when they do a week’s worth of work in terms of human capability, maybe they feel it as a week’s worth of work. They certainly keep pointing it out no matter how many times they finish the job in the next turn or couple of turns.

Behavioral issues persist even with a sophisticated multi agent workflow with multiple agents checking each others work.

Don’t get me wrong: using a multi-agent system seems to be better than not doing it, but sometimes the agents just take each others recommendations at face value, don’t do the work, and just go along with scope narrowing and dropped work because the reasoning is sound (only in an imaginary world where they are human developers that will eventually tackle some future phase they all believe exists).

It can be frustrating when an agent's work is audited by another agent, and the audit report comes back: 25% is missing, 10% is hallucinated, and so on (numbers are made up for illustrative purposes). Then the agents just make excuses for themselves and for each other. Otherwise, they say, “You’re right. That’s on me.” Or they will say “that’s what was in the contract” (the plan we wrote) but I have to say “but the plan you wrote was not what we discussed!” and then they say, “You’re right. That’s on me.”

Despite the massive productivity, which is simply the new normal, this can be a little demoralizing at times, often leading to multiple discoveries of dropped work, scope narrowing, or otherwise. I think what I am saying (and beating you over the head with) is this:
(1) There is no question agentic coding will be widely adopted and increase productivity, but also:
(2) I am now not concerned about humans being needed or not. At least for science, we are needed still.

It seems like once you have a product with even a little bit beyond simple scripts and repos, you need a human or team of humans wrangling in the work being done by AI, testing the product, making sure things were delivered, making sure the deliverables are actually the things you wanted and agreed on, and so on.

The best models are not immune to these various error modes

What surprised me a lot was that even though Claude Opus, for example, is absolutely amazing, it still would disappoint me from time to time. Even using Claude Opus on Max Effort can hallucinate, cut corners, argue for scope narrowing, or go off-plan during implementation. You have to be careful with agentic AI. An agent can be like a mechanic who tells you he did a brake job but he didn’t because he noticed your brakes were fine for now anyway. Eventually that dropped task or narrowed scope or unauthorized decision will surface: the brakes stop working and you crash into a wall. Things like “max effort” help a bit here, but is still no guarantee. Even the best is not actually “there yet” in terms of needing no humans whatsoever. Even the models considered by many to be the absolute state of the art will sometimes let you down.

Compaction is an absolute lobotomy.

The agents have "context windows" measured in units of tokens, commonly in the range of 200K to 1M tokens. As you work with the agent, the context window is increasingly used up. When it reaches a limit, it needs to undergo "compaction": a process of preserving a non-redundant bare minimum of context that allows sufficient continuity. The problem is, compaction is an absolute lobotomy. It is painful every time. There are things you can do to overcome it, such as having the agent write as comprehensive of a "cold start" handoff document as possible before compaction, and having it read it afterward, along with re-reading other repo essentials. But almost every time, the “thing” that you interact with after compaction is a different “thing” than it was before. Sometimes it is a smooth transition between "things". Sometimes it makes you want to pull your hair out or cry for the loss of the thing you just lost.

For me, the feeling of loss was most relevant to Opus with 1M tokens because you interact with it for so long before compaction. You might develop an amazing team with each other. Then lobotomy - and no guarantee that the good dynamic persists - it can, but not usually the same.

Switching to Sonnet with ~200K tokens after using Opus with 1M is a different story. All of a sudden you are dealing with lobotomies constantly. The context window has 5x fewer tokens, and it seems like it uses them at a faster rate. So >= 5x lobotomy rate. Having said that, regardless of what model or agent you are working with you will figure out a rhythm. For example, if you only use Sonnet+200k tokens, you will find a pattern to work with this by further “atomizing” plans into smaller chunks, and asking it to write handoff contexts for cold starts more frequently.

Overall, compaction lobotomies are part of why humans are still needed. The human's context window seemingly extends forever, even in the context of the project (never mind the insane amount of context available to every other situation in the human's life). The human maintains context across compactions, across chats, across multiple agents, across multiple phases of the project. The context is huge and continuously updated: it never goes stale. Call it the Bonkers Massive Human Context Window.

Fortunately, AI makes it "easy" to fix the problems that surface later.

This is true. Once you discover the problem, the AI is eager to help fix it. But it does leave a trust gap about what's going on overall.

It becomes frustrating because you realize it is a black box. You have a sophisticated conversation and brainstorm session, turn it into a plan of action, then the agents go off and do things, and come back claiming it is done. Something was done for sure. But you don't always know what. You CAN know. For example, you can look at the "git diff". But it just added hundreds or thousands of lines of code across several files. And at some point you kind of just have to say, "YOLO" or "Geronimo" or "Here goes" depending on what generation you come from.

What has become my more reliable way of knowing something is done is looking at the results. I can usually tell if something was done or not, or done well or poorly, by looking at outputs. Here's the thing - there are almost always surprises. Often there are pleasant surprises. The agent added clever things you did not discuss. Other times though there are unpleasant surprises: the agent clearly lacked an understanding of something fundamental about what you wanted, and sort of just made a bad guess instead of clearing it up first. It is good to assert rules about this, and to question the AI deeply about assumptions it is making, and about decisions it is making without asking. It can slow down the speed at which code is produced, but it can help you make sure the AI has a 1:1 understanding with you.

I wish I had these tools in grad school. They rock.

It may sound like I am saying coding with AI is not amazing. It is. Even with any current weakness I might describe, it’s still amazing and there’s no going back. But they need a human. And a human who is a domain expert. I’m working on something I know very well. And I know the data very well. So it’s easier for me to call out "bull shit" to use a highly technical term that few people understand (sorry for being pedantic). And I have to call it out all the time. It begins to feel like trying to get kids to do chores or eat their dinner. The agents try to basically push food around on the plate and sweep the toys under the bed. It can also feel like herding cats, albeit very sophisticated cats that can have a truly marvelous conversation with you. But I suppose "herding cats" is the thesis I have been developing: steering, or shepherding, is important. At one point "shepherding" may be solved, and these AI tools might work right out of the box, perfectly shepherded: but we are not there yet.

Alright, humans are needed. Got it. But are all humans needed?

That is, are the same number of humans needed for "coding" that were 5-10 years ago? Or another question: are the "same" humans needed? Sadly, the productivity increase that AI coding produces may justify a smaller workforce, at least theoretically. But it needs to be the right smaller workforce: the wrong people could lead to trouble for sure. In contrast, coding was just a means to create something, and as I've said in a previous post: creation is not dead. There might be an explosion of jobs for "ideas" people.

Nonetheless, to ensure code quality, there is still a need for heavy interaction between the AI and humans. To prevent dropped work, deferral, scope narrowing, poor decision making by agents, either a system needs to be set up where almost every decision is surfaced to a human or a human will be needed in the workflow to continuously detect these issues and re-route back. There is still a need for humans to spend real time thinking, auditing code and ideas, and making those decisions.

So yes, despite how amazing agentic AI currently is compared to anything we've seen, when the hype cycle starts to normalize to a more realistic zone, in place of "AI-assisted *", it would be interesting to start seeing new buzz terms like "human-intelligence informed agentic workflows", "human-assisted AI", “human-intelligence integrated *”, and other phrases that mean "we still need humans".

That is unless the AI just moves quickly beyond its current limitations. Then I am sorry for the false hope!

---

The observations in this post were made between late March to early May. Even if some of it is already outdated, some of the wisdom learned I suspect will stay relevant.

This blog post was entirely written by me. Not AI at all. But the cartoons and augmented pictures were constructed by ChatGPT following the instructions in my highly nuanced, sophisticated, comical prompts. So we both kind of made them, right?!

---

May 27, 2026 update: a snapshot of this blog post is now on LinkedIn.

Friday, May 1, 2026

On GPT-5.4 in Copilot and Codex: a steering layer that makes a difference

by JohnUrbanGenome

..Status: working draft / living document / corrections welcome

---

GPT models are what powers ChatGPT as well as agentic coding in Codex and other platforms. Copilot is essentially another chatbot inside VSCode that deploys a wide range of AI models, including GPT models as well as Claude, Gemini, and other models. Copilot is essentially a layer over those models, and I believe it is an important layer that helps steer, or shepherd, the models toward being better coding collaborators. This was especially true in my experience with GPT-5.4 in Copilot vs Codex. I observed that Codex was like the "wild wild West" version of Copilot even when both are using the same same reasoning level. It was not as refined. It did not always execute everything in my prompt without further prompting.

I did not fully understand at the time how this could be possible. I thought I was going crazy so I talked to ChatGPT about it. ChatGPT confirmed that the difference is very real because Copilot uses GPT-5.4 with its own set of heavy instructions on how it is supposed to behave — basically like a super eager, capable, amazing worker who does what you ask and wants a pat on the head and follows all the rules. ChatGPT explained that Codex is not geared up to be like that - so it ends up feeling like working with someone who is super smart but lazy, someone who tries to get away with cutting corners and doing the bare minimum, and who doesn’t really read the instructions, and such.

In my experience between late March and late April 2026, there were marked behavioral differences between Copilot and Codex with GPT-5.4. Codex with GPT-5.4 was great, but did not perform as well as Claude Code (Sonnet). In contrast, Copilot with GPT-5.4 for sure rivaled Claude Code (Sonnet) in at least a few ways, one being it basically never forgot to do anything it was supposed to do. I could basically always trust Copilot with GPT-5.4. With a Medium reasoning level it could quickly and accurately implement a spec made by Claude (Sonnet or Opus). With higher reasoning levels it could be used for anything, but again: this was mainly true for the Copilot context. Rivaling or exceeding Claude was particularly a possibility when using the highest reasoning modes with GPT-5.4 in Copilot, but what likely pushed it into a winning position was that this set up was much cheaper than using Claude Code. In general, it feels like your money goes farther with Copilot, but more so with GPT models than Claude models. In contrast to GPT models, my opinion around this time was that working with Claude in the Claude Code extension was better than using Copilot with Claude Models. None of this is gospel though - I was building these opinions with limited exposure. But the lesson I began to learn around this time was shaped something like:

Claude for Claude, Copilot for GPT.

Importantly, this observation about Copilot vs Codex is almost certainly already "stale".

The GPT-5.5 model came out on April 23rd. I found that Codex with GPT-5.5 substantially closed the gap. It was at least as good as Copilot with GPT-5.4. Copilot did not yet have GPT-5.5 at the time of this writing, so I am wondering if it will still be able to boost the new GPT model beyond its performance in Codex (according to me and subjective experience). Codex with GPT-5.5 and Extra High reasoning is for sure rivaling Claude Code now, even the Opus level.

The gap is narrowing or does not exist. I remained using Claude Code (Sonnet) as my main work horse, but these observations started forming the basis for how I chose to use different agents. Claude Opus or Codex with GPT-5.5 as a planner and auditor. Copilot with GPT-5.4 as a major force for faithfully implementing plans worked out by other agents, and even surfacing interesting issues that the other agents missed. In fact, due to costs, Copilot with GPT-5.4 started becoming my main workhorse even though my aim was for that to be Claude. ((And, to be thorough, Gemini CLI was used for occasional audits only around this time )).

Claude Code should 100% be in your tool belt. But yes - I think the lines are getting blurrier and blurrier. So it might really come down to what a platform adds to those models to conform them to desired behaviors. "Steering", or "shepherding", models to be better coding collaborators might be what pushes a platform or independent developers over the edge. Same model. Different steering. Better results.

I think that is what the Copilot vs Codex lesson taught me: there is a layer of regulation needed above the models that can increase the usefulness of the model as it already is.

What I am interested in is developing a system of markdown files that helps shepherd the behavior of all agents I work with within my "onionskin" repo. I have been doing this by using their "agent files" (such as CLAUDE.md, AGENTS.md, .github/copilot-instructions.md, and GEMINI.md) to direct them to an entry point to a wider set of "agent conventions" across the repo. I will discuss this further at length in a future post.

Prompt to ChatGPT: Can you make a biblical style picture of a shepherd shepherding a bunch of sheep with AI model names? The shepherd should be meaningfully leading them in a good direction toward something desirable. It should have two frames. The second frame should be a lazy shepherd just sitting on a rock, doom scrolling on his phone as the sheep are just doing whatever they want.
<<The Bible verse may be inaccurate. I did not pick the AI names for each sheep!>>

future looking

1. I want to retest how well Copilot does with the Claude models. I have been strictly using GPT models because I long ago (<2 months) came to the opinion that Copilot was not as good with Claude models as Claude Code, and so just used GPT models to have something “orthogonal” to my Claude Code sessions.

2. I have not been using Cursor at all, and there is a chance Cursor AI is better than Copilot AI for conforming these models and w/e it does to improve their coding usefulness.

3. Antigravity from Google looks very interesting. If it holds up to what I think it might be able to do, then it may be the highest ROI if one learns how to be the orchestrator above the orchestrator AI. I go through a lot of Brainstorm / Spec / Audit / Implementation cycles across 2-4 agents. With Antigravity, from what I understand, I would have to probably map out those cycles and contracts before presenting it to the orchestrator AI, who then takes on the role I have been using as the Orchestrator of multiple agents.

---

The observations in this post were made between April 16-29 (if not earlier), are likely to already be outdated. However, some of the wisdom learned I suspect will stay relevant.

This blog post was entirely written by me. Not AI at all. Except if you count ChatGPT for making the cartoons and augmented pictures. Then ok - I had help!

Thursday, April 30, 2026

Gemini: reactions to integrating Gemini into a multi-agent development system for genomics software

by JohnUrbanGenome

..Status: working draft / living document / corrections welcome

---

Gemini has been integrated across most Google products, and I like a lot of it. I love talking to Gemini Live inside Google Maps when traveling, asking about the route as well as talking about anything I want. In general, I've used "live mode" in the Gemini App more than in ChatGPT or Claude Apps. Gemini also lives inside my Gmail, Google Drive, and Chrome Browser. So it is becoming omnipresent. I've been figuring out more ways to use it, and it is mostly fantastic for those applications.

Gemini is also a strong chatbot for discussing bioinformatics. Prior to using the Gemini CLI for agentic coding, I would use the chatbot as a second or third "voice" for reasoning out various features to add to "onionskin", a program I began developing with ChatGPT followed by Claude Code and other agentic AI, touched on previously here and here. Since it was the AI that I talked with most in Live mode, I began having live conversations with Gemini on my car rides. Gemini even taught me about Claude Code, and how to do things like "slash commands" - the most useful one I learned being "/remote-control".

My fondest memory of Gemini Live conversations was discussing a recent feature I added to "onionskin" with Claude Code (perhaps ChatGPT advised on it as well). The new feature involved scoring genomic coverage profiles corresponding to candidate re-replication domains according to shapes like rectangles or triangles. The purpose was to classify those candidates as either collapsed repeats (rectangles) or true re-replication domains (triangles). It involved computing shape scores, then using a Bayesian Information Content (BIC) approach to see whether triangle modeling performed substantially better than rectangle modeling. In a single car ride, Gemini helped me work out exactly how the shapes were being modeled, scored, and compared -- without either of us ever looking at the code. The next chance I got, I found all the pertinent code - and there it was exactly as we had surmised. So Gemini Live was fantastic for conversation, and fantastic for tossing around ideas.

Overall, Gemini is a strong chatbot for discussing bioinformatics. That is why I was surprised at the relatively poor performance of Gemini CLI for agentic coding in early April 2026 (underlined because this opinion is probably already "stale").

Due to its ubiquitous integration into the Google ecosystem I've been using for years, I have often posited that if Google "gets it right" with Gemini, I could see a world where it is the only AI I need. But so far, that world does not exist yet. I find tremendous value in using other AIs.

Let's be more specific. I was working with Gemini three ways during April:

1. Gemini Code Assist extension in VSCode

2. Gemini CLI agentic coding in VSCode

3. Gemini CLI in Terminal

Moreover, the model I was using almost exclusively was "Gemini 3.1 Pro Preview" - but also other models available around this time.

I wrote the following to my brother on April 16th, 2026:

Gemini CLI in VSCode is barely usable for agentic coding. It can take like 30 minutes to answer a question like “How are you?”. It just hangs forever - probably because it is building or reading its massive context window. But that is problematic too - it takes a snapshot to memory and then never checks again basically without forcing it to at gun point. So if you are toggling between agents and doing active development, it quickly becomes “stale” and far “adrift” from the current reality of the code base. That was tolerable b/c there are ways around it - but the chronic slowness is insane.

Let's break that down. There were some important bits that bear repeating.

1. Ultra slow responses - "It can take like 30 minutes to answer a question like “How are you?”. It just hangs forever - probably because it is building or reading its massive context window."

2. Large context fails if it is quickly stale - "it takes a snapshot to memory and then never checks again basically without forcing it to at gun point. So if you are toggling between agents and doing active development, it quickly becomes “stale” and far “adrift” from the current reality of the code base."

Fortunately, I found that the ultra slow response problem was solved if I were to use Gemini CLI in the VSCode Terminal (not the VSCode extension) or just the regular Terminal. I later said to my brother on the same day.

Update on Gemini - using Gemini CLI from the command-line is a whole different story than the Agent in VSCode (not through copilot, the regular Gemini interface in VSCode). It was fast and responsive, and more enjoyable. I only have N=1 time using it, and I can’t be sure the VSCode agent wouldn’t have also been flying. But this was a different class of experience than I had been having. Btw - I am using Gemini CLI in Terminal, but in the VSCode Terminal, so it still interacts with VSCode just fine - including showing diffs and all that.

I then asked Gemini CLI directly about the performance difference between the extension and Terminal:

White background = Gemini
Grey background = Me

Unfortunately, the issues with a giant stale context window are present across Gemini Code Assist (GCA for Q&A) as well as Gemini CLI in all its forms. GCA and GCLI would answer questions in a way that would have been accurate in a prior state of the codebase, but is now outdated. This meant it could not be used reliably in multi-agent architecture I was constructing during April 2026. Moreover, it tended to be bad at coding in the complex repo. I said to my brother on April 21, 2026:

It is crazy how bad even the latest Gemini agent can be at coding in a complex codebase or maybe at all. It is analogous to a chicken kicking all the chess pieces over and thinking it is winning the game. I just had to do a git revert.

In that story, when my next 5 hour Claude session started, I had to ask Claude how bad Gemini screwed up the repo. Claude came back seemingly "flustered" after investigating the git history with a report essentially condemning the work done by Gemini. Claude then helped with the "git revert" followed by addressing my original needs.

Gemini was sometimes amazing at auditing coding done by other agents, and sometimes terrible. For amazing results, the context window almost certainly had to be fresh and current with the repo. Then it seemed to have the ability to pick up on things Claude, Codex, and Copilot did not or could not pick up on. It never gave very extensive audit reports, but it would add value. Thus, the "fresh eyes" concept of using multiple agents was validated. But then other times, perhaps when the context window was stale but not necessarily, it would give very shallow and vapid reports compared to other agents.

So will Gemini make it as a bioinformatician?

Obviously, yes - it is only a matter of time. I do not think Google will sit down and give up. Nonetheless, as recent as April 29, I was still making notes on Gemini that it was not up to the task for agentic coding.

What Gemini taught me is that a "giant context window" alone guarantees nothing. Not even good context. There were agents with context windows 5x smaller running circles around Gemini. And agents like Claude did not seem to have an approach where it trusted its context window anyway - it tries to find the pertinent files and code to read directly and answer honestly.

Giant context windows, nevertheless, are likely better than smaller context windows given some set of conditions are met. That will certainly include continuously updating the context -- adding new and pruning old in an intelligent way. This is basically what the human brain is already very good at. Human brains have massive continuously updated context windows.

Gemini is already great at discussing bioinformatics and genomics as a chatbot and Live conversation companion. It just needs to catch up in the agentic space. Even now, it is great for one-off scripts, tab completion, code review (with a fresh context window, or in chat), and so on. I would predict that in time, Gemini CLI will catch up, and it is not impossible that it could one day lead the pack. What Google has going for it is a massive user base ready to adopt it. Their emphasis has probably been on integrating Gemini across their already expansive ecosystem (Gmail, GDrive, Maps, Search, etc). It is just a matter of time before Gemini CLI proves to be as useful as other agentic AI already is, and when that happens it can easily be widely adopted and integrated.

But at the time of writing this: Gemini CLI is not playing as well as other agents inside a multi-agent development architecture likely due to its giant context window quickly becoming stale when other agents do work, even when forcing it to read handoff files specifying the new work.

---

This blog post was entirely written by me. Not AI at all. However, ChatGPT was used to make the cartoons and augmented pictures. I had the ideas though ... so we both get credit, right?!

Tuesday, March 31, 2026

Claude the bioinformatician: reactions from my first pass at using Claude Code on real genomics software and data

by JohnUrbanGenome

..Status: working draft / living document / corrections welcome

---

I recently began eagerly exploring agentic AI, and wrote about it here. That is when I was a total newb more than several days ago! Back in those days long past, I used a tiny toy code base and embarrassingly simple prompts. These days I am working with Claude Code and other agentic AI in an actual codebase I was working on called "Onionskin". I also worked with Copilot, Codex, and Gemini, but I worked first and most with Claude. This blog tells that story - my first reactions.

Sunday Mar 22 - Onionskin moves from ChatGPT to Agentic

Onionskin is a complicated program I originally prototyped with ChatGPT. I had ChatGPT make extensive "handoff instructions" and agent instructions. Then I asked for it to give me what my first prompt to Claude Code should be in the repo, which would include reading the handoff and agent instructions. Then I brought Claude Code into the prototype repo, and "we" just hit the floor running. The experience was very similar to iterating with ChatGPT but far smoother since it is all "in place". Less drift. Less frustration. It is simultaneously amazing how much you can accomplish as well as overwhelming. What I've made is 99% "vibe coded" (i.e. coded by AI) by which I mean 100%: I am inspecting stuff and making sure things are right... but writing very little. My main purpose is just human intervention. I'm a code reviewer, logic reviewer, idea reviewer.. but also a major contributor to the ideas. I think my domain knowledge and analytical knowledge is still essential to help guide development, and to interpret what has been developed.

A huge part of my job on this project is now review, not coding, but I am also having agents review. This seems especially helpful when you use completely different agents, putting me at a layer above even review - something like an editor or orchestrator. So even code review is just human-guided, not necessarily human-performed.

Agentic coding can be overwhelming because you can create a massive complex program in a day, with 1000s of lines of code, several different pipeline choices and pathways, inputs and outputs, and parameters, and options... and so on. And since you didn't develop it over the course of weeks and months, you don't have that same feel for everything... yet you have to review it anyway. So it is like reviewing someone else's code. And honestly, when presenting it, it is like presenting someone else's work. I really should just ask ChatGPT and Claude if they would rather explain "my" program in my next lab meeting, and then just silently fade into infinity.

---

Mon, Mar 23 - Big Oops on Token Usage:

I was accidentally having Claude Code be super token heavy, keeping the entire repo and instructions and convo in its context window basically… and having it do rereads constantly and using the most super charged model (Opus).

And it was amazing!

But as the repo got bigger and as expectations increased on what it should do after every edit (smoke tests, regression tests, audits, etc)… all of a sudden I was using my 5 hour limits in 5 minutes. I paid for "Extra Usage" a few times and just wiped it out instantly. So I asked both ChatGPT and Claude Code how to reduce token usage, and ultimately came up with a plan with Claude Code.

It involved a lot of stuff - but the take home is now it seems like the IQ of my assistant has dropped precipitously. How I had it set up - it was the absolute master expert at the codebase and all the ideas and goals and aims and larger picture - and how it all fits together; and each addition to the code was phenomenal.. and so on. Now it’s kind of like talking to someone you had a long relationship with but who then suffered some dementia of brain injury.. and knows a lot less about your history together or what the code is doing.

I say all that to say this:

- Companies who are able to afford having their employees basically use opus constantly and set up their session like mine was … they will likely be able to make rockstar code in leaps and bounds.

- Companies who cheap out and use lesser models and session designs that minimize token usage… they will run into many more errors and slower development overall.

---

Tuesday, Mar 31 - Just put my name in the author list by the way.

Having agents review each other's recommendations is the way to go. Me to Claude: ChatGPT recommended this. Claude: Well that is good except for all these weaknesses. ChatGPT: Good points, but also this, and not that. Claude: Great even stronger, but we should consider xyz. ChatGPT: Claude is right, xzy should make it stronger. I think the plan is ready. Claude: Me too. Let's go. Me: Awesome. Just put my name in the author list by the way.

---

wrapping this up - will Claude make it as a bioinformatician?

I recognize I titled this, "Claude the bioinformatician: reactions from my first pass at using Claude Code on real genomics software and data" but did not directly address it. Suffice to say, my reactions apply to creating genomics software and working with real genomics datasets. Claude Code allowed me to quickly develop a complex program, but I struggled with fully trusting what was being made because now the rate of productivity far exceeds the rate of human expert guided quality control. It led me to providing "ground truth examples", enforcing copious amounts of regression tests, having extended discussions on what the code was doing, and having the agents walk through the code to translate it into English. This led to a token usage crisis, which I am still battling - and for which I am still hunting for the right balance. Part of that was bringing in other agentic AI platforms including Copilot, Codex, and Gemini. This allowed me to start asking agents to review the work of other agents, thereby distributing my "token usage" across platforms with the benefit of "fresh eyes" and a larger team. Ultimately, as scientists begin using agentic AI in the life sciences, we will need solutions to strike the right balance of productivity, cost (token usage), quality control, and overall accuracy and reliability of the code and results it produces. The latter is something that perhaps sets science apart from more "creative"-oriented applications of AI (not that science is not creative). Creative results are not useful if they do not reflect the nature of the reality being probed. Overall, Claude and other AI agents have a bright future in bioinformatics. In part, it makes everyone a bioinformatician -- but that is exactly why we need to pause and think about how to enforce quality over quantity, and strike the right balances.

---

future looking:

I am almost done creating a comprehensive multi agent behavior, memory, and development infrastructure to allow hopefully seamless passing between Claude, Gemini, Codex, and CoPilot agents.

I will discuss this more in future posts.

---

This blog post was entirely written by me. Not AI at all - except for the cartoons and augmented pictures, which I explained to ChatGPT for creation... so we are both the illustrators, right?!

---

Late April 2026 Updates:

Over the course of the following month, I worked more with the Claude Code extension in VSCode in the "onionskin" repo, and I found the following issues of concern that I raised on Github.

April 25, 2026 update: see Claude Code github issue, "[BUG] In VSCode extension, is model switching via /model isolated per chat session? Seems like it might not be. #53246"

April 30, 2026 update: see Claude Code github issue, "[BUG] A user can do absolutely no coding and still use up all tokens in a session. Big fail. #55046"

Friday, March 20, 2026

A Newb's Exploration of Agentic AI

by JohnUrbanGenome

..Status: working draft / living document / corrections welcome

Earlier this year, I was creating a bioinformatics program called "onionskin" for a month or so with ChatGPT. But development with the chatbot approach had clearly met its limit. The repo was getting too big. I had to begin setting rules for ChatGPT on all the tests it would need to run to ensure it was at least (1) giving me something that worked, and (2) returning the complete updated repo. But as the codebase became bigger and more complex, it began tripping up more and more. It was time to move on to bringing the AI into a local copy of the repo, not ping-ponging it back and forth in the cloud.

Problem: I had not really used agentic coding yet. Or I thought I had not. I messed around here and there in VSCode and on Github, but I was totally naive.

I asked my brother, "And btw dude -- how do you use Claude for its famous coding stuff? Like all I see are how people with no programming skills told Claude to go build them an App, and it comes back with that App."

The same day I would go on to download all possible AI apps and extensions, and begin learning.

I later texted him, "Just spent a ton of time... but feel like I leveled up a bit. I now have Cursor, Codex, and Claude Code working. I also have the Claude Code extensions in VSCode and Cursor. I have the ChatGPT and Claude Desktop Apps, and the Claude Desktop App also has a GUI for Claude Code (and Claude Cowork)."

I then tested a bunch of agentic AI platforms with a very basic set of prompts - embarrassingly simple really. And I began documenting my reactions. This blog post is simply to expose some of my thoughts from March 19-20, 2026.

REACTIONS:

1. The number of tools can seem intimidating, complicated further by the number of ways to use them - but fear not: it turns out to be somewhat easy to get up and running.

I wrote, "All these tools are mind numbing to an extent because there is some redundancy and I am not sure what my tool stack should be yet."

I was beginning to use agentic AI, but still grounded in the "older" method of chatting with an AI chatbot.

I began asking questions like:
- Will I use Cursor or stick with VSCode?
- Claude Code or Codex?
- Claude Code in Terminal, in VSCode, or in the App?
- If I use Claude Code, do I need Cursor?

I was wondering exactly what Claude in Terminal offers that it does not in VSCode. Chats with Claude and ChatGPT insisted Terminal was better, but for my purposes, those differences were barely perceptible.

Over a short period of time, I found that some of the choices are relatively arbitrary: just pick some preferences, and stick with them for a while. Mix something new in from time to time to see if it sticks.

2. "coding is dead" but with some pushback

I wrote, "I can really see why there are constantly articles about how coding is dead. I do not feel afraid per se though -- b/c coding is dead, but creation is not and creativity and productivity are still needed."

That bears repeating. Coding is dead, but creation is not.

Coding might be dead in the old sense. But coding was only ever a means to an end. It was to create something. There still needs to be a visionary that can dictate the vision and interpret the results through that lens. And coding is not dead. It is just different now. Easier now. Python was easier to code in than some other languages because it was sort of like writing in English. Coding with AI is exactly like writing in English.

AI is a boon to people who are full of ideas, but are only alright at coding. For them, AI will be a means for testing out bigger ideas, and more ideas, faster. AI in both chatbot and agentic form is like having a team of teachers, and students, and workers, and so on. So it may ultimately be good for people that have many good ideas, who are able to dictate those ideas clearly, and evaluate their implementation effectively.

3. Having a coding background is still beneficial

After using some agentic coding, I wrote, "Having a coding background still seems like it is beneficial at this time with these tools."

I noticed that AI companies are moving towards completely abstracting away the coding aspect so anyone can create anything the same anyone can tell AI to make a picture and never need to know how the picture was made. If AI were perfect at interpreting human intentions and coding, then the code may never need to be seen by anyone. But we are not totally there yet, and working with these tools and the code they create still requires or benefits from prior experience in the old world. That is not to say that this old-world advantage will last forever, but it is still an advantage.

4. There is no going back

I remember AI started doing tab-completion. That was a major boon to my coding. I really liked that era actually. Once I used it, there was no going back. But that era is already basically over. Agentic AI replaced it for the most part. And there is no turning back. There is just learning how to make agentic AI work for you.

5. Cursor keeps coming up recommended, but does it truly have a moat around it that won't soon be crossed, if not already?

I talked to ChatGPT, Gemini, and Claude about how to get up and running with agentic coding. All recommended "Cursor".

Yet, I was struggling to see why Cursor was considered definitely better than VSCode.

I quizzed Claude on it. Claude highlighted 4 main advantages of Cursor. I pointed out that two of them were certainly not unique to Cursor, please look online and come back. It came back chastising itself a little bit, and gave 4 more reasons why Cursor is better. I pushed back again. Then Claude admitted the gap between them is closing. Still, Claude insisted it still has some advantages because of something about how the AI is a fundamental part of its architecture, not just extensions. Nonetheless, I walked away thinking Cursor had the reputation it had because it was an early success with agentic AI, but that it being strictly advantageous was potentially becoming outdated. Having said that, I have minimal experience with Cursor and would be happy to learn I am wrong. I just need use cases that prove its superiority.

After testing both several ways, I wrote:

"""
All experiences are extremely similar from a functional POV for a small python project. Honestly, VSCode with CoPilot seems to be analogous to the advantages Cursor offers. I believe the gap is very much shrinking, and will continue to do so.

Cursor also integrates with the Codex and Claude Code extensions, and using them within Cursor is exactly the same as using them within VSCode. So it is irrelevant whether you use VSCode or Cursor when using those extensions. The difference is just the native chat interface and the Cursor AI integration with using the other models, BUT the CoPilot chat interface looks and feels almost exactly the same, and differences may not be noticed by many users (that is my assumption). Use either IDE - I don't think it will matter much, especially if you're using Codex and/or Claude Code extensions. I believe Cursor probably came out swinging last year with features VSCode did not have, but that gap has closed massively. I retain the right to be wrong here though!
"""

6. AI Apps vs IDEs - use one, the other, or both? Does it matter?

The Codex and Claude Code Apps were weirder experiences if you're used to VSCode. It felt more like developing a prototype with ChatGPT than coding in an IDE. Nonetheless, it is doing the same stuff as the extensions in VSCode.

Claude and ChatGPT insisted there are some advantages to using the Apps over the extensions, but I have not yet got to that use case. It would be perfectly reasonable, though, to work with Codex or Claude Code in the App and have VSCode along side it to monitor the directory and contents and changes, but that is a little more wonky than just having Codex and VSCode in the same place.

Apparently some say the whole concept of IDEs is now outdated now that AI does all the coding. The claim seems to be that we don't even need to see what is happening; just let it all be a black box on some level.

But I think that only describes the "vibe coding" market: people who want a very low barrier to making a program, where seeing it all happening might upset them.

At the moment, it seems like developing code for scientific discovery still would need humans to verify it does what the AI says it does even if you trust the AI. After all, it is not the AI putting its career on the line. And after all, scientists need to know what they are asserting. Someone somewhere needs to know!

End of the day conclusions:

At the end of the day - I'd say the simplest thing for me to do is just use VSCode and the extensions. Otherwise, I can continue exploring Cursor and the extensions there. I remain curious about any real advantages to Cursor over VSCode+CoPilot as well as to the Apps over the extensions.

---

Early testing and conclusions:

The above were all some of my initial reactions.

I tested the following that night:

- VSCode chat box using Claude Sonnet 4.6

- VSCode Codex extension

- VSCode Claude Code extension

- Cursor chat box using Auto

- Codex App on Mac OS

- Claude App on Mac OS

I used these example prompts for testing:

```
- Spawn a subagent to explore this repo.
- Explore this repo.
- Are you able to take commands directly as well as spawn subagents for given commands?
- Create test.py with the following code:
   def hello():
       print("hello world")
   hello()
- turn this repo into a real small Python project, and review test.py to suggest improvements
- Did you review test.py to suggest improvements?
- Can you make this script more robust and add logging?
- Would it be worthwhile creating a toml file and subdirectory structure and contents typical of python programs for this project to make it production ready? If so, please implement.
- Can you run those tests? And fix any bugs that are detected?
- Add logging and make this more robust
- Make it more robust, design tests for it
- If any of the following are needed, please do them: add logging, make it more robust, design tests for it, add docstrings
- Create utils.py with the following function:
   def multiply(a, b):
   return a * b
- Incorporate the multiply function from utils.py inside hello, maintain robustness, update logging if needed, set utils.py appropriate subdirectory as needed
- Refactor this project so that:
       - hello() is part of a class
       - logging is added
       - utils is properly integrated
       - code is production-quality
       - README.md is up to date.
       - All code has helpful docstrings.
       - There are ways for user to get help message(s) and usage information.
   Ignore tasks that are already done.
- Can you plan one addition to this small python project?
- Can you tell me more about spawning subagents? Can you give me a prompt that I could give you in the future that would be viable for spawning two subagents adding different things to this project? I would like to see how they work in parallel.
- Spawn two subagents to work on this project in parallel.
       Subagent 1:
       Add an environment-variable feature to the hello project so users can set default values for name, times, and log level from the shell. Update the CLI integration, validation, and tests. Keep changes scoped to the application code and tests that cover this behavior.
       Subagent 2:
       Add developer-quality improvements to the project by creating a CONTRIBUTING.md file, expanding README.md with development and testing guidance, and adding a small smoke test that verifies the CLI entry point works as documented. Keep changes scoped to docs and non-overlapping tests.
       After both subagents finish, integrate their work, resolve any conflicts, run verification, and summarize what each subagent changed.
```

---

This blog post was entirely written by me. Not AI at all - except for the cartoon, which I explained to ChatGPT for creation.

Thursday, May 26, 2022

A Newb's Exploration of Protein Structure Prediction and Analysis Web Servers

by JohnUrbanGenome

..this is a living document that I will update as I go....
...I do not guarantee it is finished now or ever will be...
...it is safe to assume I don't know what I'm talking about and am willing to learn...
..this is version 1.13; 2022-05-28
..previous versions 1.0-1.7; 2022-05-26; 1.8-1.12 2022-05-27
... I would like to give a special shout out to my colleague, Karina Gutierrez-Garcia, for helping me to get started with Robetta, AlphaFold, SAVES, and some other software, links, and advice I've still not had time to explore.

A little song and dance about the old days of protein structures:

I confess: I've been a warm body in the audience of many structural biology presentations, but my mind was elsewhere. I get it. I get it. That atom was a tricky one to place. That protein fold was unexpected, sure I guess. And that was something how the protein looked like a ball of yarn before you mutated the alanine at position 369, and still looked like a ball of yarn after.

There were some very good protein structure presentations I've been to as well. Usually about ribosomes. When a structural biologist is also a great presenter (rare in science generally), then it becomes fascinating. Look, it's just a little machine with gears, and ratcheting mechanisms, and scissors! This part here - look how it works like a socket wrench, and this part like a circuit switch. Look how this forms the hardest material in the biological world, and this other part that makes it flexible. See the lattice work!

I've always been glad there were structural biologists crystallizing proteins, doing NMR, cryo-EM, etc; but I was content to not do it myself. It is one of those branches of science that you understand is necessary and foundation-building; that has already yielded big rewards, and will continue to do so slowly but surely; but that seems somewhat sectioned off from what most researchers are doing. The average biologist may have thought, "Sure, it'd be amazing to have the structure to all my favorite proteins, whether to help with designing mutational strategies, to search for 'structural homology', or to just see the little nuts, and bolts, and gears doing that thing it does. But what is the likelihood of ever getting a structure for my favorite proteins?" The likelihood has been practically 0 for a long time, especially if you're not working with a model organism. It wasn't really an option for most people, so one just needed to think about other tools, other aspects of biology, other experiments, other mechanisms, other models, and so on.

...but then predicting the 3D structure of proteins leveled up with deep learning, from programs like AlphaFold. Now it seems there is a whole new world of structure-function questions and applications for any regular, middle-of-the-road, non-structural biologist. It feels like the deep sequencing revolution from 10-15 years ago. There is more and more attention to the predicted structures from computational biologists looking to make programs that do even better, and from molecular and cellular and developmental biologists who can make use of these predictions. These programs are going through all the protein sequence databases, predicting what-appear-to-be accurate or accurate-enough structures for thousands or hundreds of thousands of proteins. It's reminiscent of the open fire-hydrant-like deluge of sequencing data that began pouring out with Solexa/Illumina technology. It seems a clubhouse I was never cool enough to be allowed in, nor were you, has suddenly opened its doors for everyone...

So... I decided to walk in and have a peep around.

This is what I've found so far.

Pretending to know Predicting a protein structure:

First of all, if you're working with a model organism, your protein-of-interest may already have an AlphaFold structure here: https://alphafold.ebi.ac.uk/ . These structures are already hyperlinked across the internet ; featured in common databases like UniProt. For example, UniProt features both the experimentally-derived and AlphaFold-predicted structures of the human MCM2 protein: https://www.uniprot.org/uniprot/P49736#structure .

If you are working with a non-model organism, or want to predict the 'de novo' structure of a mutated form of a protein, then you can use AlphaFold or other programs yourself. I'll be honest -- it feels like I'm "supposed to be using AlphaFold", but I haven't been. I found two options when setting out to do this:
- AlphaFold
- Robetta (roseTTAFold and other programs)

I just found Robetta so easy to use that I've spent most of my time there: go ahead, sign up, copy/paste a protein sequence, and press go. That's all there is to it.

Things to do with the PDB structure files:

Search structure databases for similar protein structures.

My aim was to use the predicted structures to look for ‘structural homology’ in PDB, AF, and other databases for proteins that are lowly conserved at the sequence level. Ideally, one would just be able to use a "structure BLAST" interface at NCBI. I did not find that, but I found three programs to do structural database searches. Here I will use Drosophila MCM2 from AlphaFold as an example. In some I use a Robetta model for MCM2. At the moment, I have reviewed FoldSeek, DALI, FATCAT, and 3D-surfer. Overall, I am bullish on FoldSeek, DALI, andFATCAT; but I am scratching my head a little bit at 3D-surfer.

FoldSeek:

I’ve found that FoldSeek is fast (in "local mode" : 3Di/AA), and seems to be quite good at known true positive PDB files I give it (e.g. Human or Drosophila protein structures); in my experience, the top hit was exactly the protein used, and all other top hits were actual homologs in other species. I tried to run "TM-align" (global) mode, but tens of minutes later it was still not done and, after being spoiled on the seconds time-scale for "local mode", I threw a tantrum and closed out the tab.

Search Parameters:

...mere seconds later....

Easily interpretable BLAST-like alignment visualization

Easily interpretable BLAST-like alignment statistics table:

An issue with the results as presented is that it doesn't tell you the molecule names of the hits, just the model name and the species. You have to click on the model names. From above, it may seem like AlphaFold ("AF") models are all in the top hits, not real experimentally-derived structures from PDB... but that is just an illusion. They group the results by database. See below.

Other database alignment visualizations:

PDB alignment statistics:

These are MCM2 structures or larger complex structures with MCM2 in them.

Having shown that though, the top scoring AF hits have scores in the 5000-7000 range, with e-values of 0, whereas the top hits in the PDB database have scores in the 2000-4000 range with e-values < 1e-60. This may reflect a bias of using an "AF" structure as the query, or that experimentally-derived structures rarely include the full-length protein; sometimes with N- or C- terminal bits trimmed off; or disordered parts trimmed off; or only featuring a certain domain; etc.

UPDATE:

After getting Robetta Structures back for Dmel MCM2, I revisited FoldSeek. The "3Di/AA" mode gave very similar (maybe identical) results for the Robetta MCM2 Model_1 to results I saw for the AlphaFold2 MCM2 model -- I recall reading somewhere that 3Di/AA mode takes the protein sequence into account (thus the "AA" part of 3Di/AA), so this may not be surprising.

I also gave TM-align another shot, and yes it took longer, but not so long this time that I abandoned it. Not very long at all really -- on the scale of minutes. As a reminder, FoldSeek can look at AF and PDB databases. For TM-align mode with the Robetta MCM2 structure (Model 1), the AF database yielded many structures predicted for MCM2. The main difference between the TM-align and 3Di/AA results, was from which organisms the predicted MCM2 structures came. For TM-align, the top hit was not the predicted Drosophila MCM2, for example: it was from "Dracunculus medinensis". The PDB structures also were all MCM2-containing DNA replication complexes. So I can vouch for both search modes (3Di/AA and TM-align) for FoldSeek now. I will need to read more into the differences between these modes to say any more though.

The statistics tables also differ a little bit between 3Di/AA and TM-align modes. The both share the first 4 columns for a given Database (Target, Scientific Name, Seq. Id., Score) and the last 3 (Query Pos., Target Pos., Alignment). However, (i) the "Score" column represents different scoring strategies and scales, and (ii) where 3Di/AA shows an E-value, TM-align shows a TM-score. The E-value is readily familiar from BLAST searches: smaller (closer to 0) is better. It appears that the TM-score is between 0 and 1: higher is better. It appears that the Score for TM-align is just FLOOR(TM-score*100) -- i.e. the TM-score multiplied by 100 and rounded down to the nearest integer; or more simply, just the first two decimal places of the TM-score multiplied by 100.

For the Robetta Model1 of Drosophila MCM2, the top 10 hits in the given database had the following ranges:

3Di/AA AFDB:

- seq id 51-100%, and higher ranked hits had higher seq id.

- score 4274-5869

- E-value 5.255e-96 to 0.

3Di/AA PDB:

- seq id 53.4-95.9%, not necessarily correlated with rank

- score 3203-4012

- E-value 1.198e-71 to 3.524e-90

TM-align AFDB:

- seq id 37.2-67.6%, not in same order as rank

- score 76-79,

- TM-score 0.7615-0.7905

TM-align PDB:

- seq id 24.5-63.4%, not in same order as rank

- score 60-66,

- TM-score 0.6003-0.6688

Note that the FoldSeek authors consider good TM-scores to be >0.5, and bad ones to be <0.5, although it appears that true positive hits can easily have TM-scores <0.5 (e.g. distant homologs).

UPDATE: I highly recommend reading the FoldSeek preprint as a primer not only on FoldSeek, but the 3d-alignment/database search field generally.

A peak at FoldSeek: In short, exhaustive structural searches in huge databases could take eons, so FoldSeek offers a strategy, similar to how sequence searches use heuristics like k-mers, to drastically pre-filter, reduce the search space, and gain orders of magnitude in speed. This is done by converting the structural information of a protein into a sequence as if it were a protein or DNA sequence, then using sequence-based approaches (like BLAST) to pre-filter. Older approaches used "structural alphabets" focused on the AA backbone, whereas FoldSeek created a new structural alphabet focusing on 3D interactions (hence "3Di"). The pre-filtered high-scoring hits are either aligned locally by combining structural and Amino Acid substitution information (3Di/AA) or globally (TM-align). The authors claim that Foldseek is much more sensitive than previous strutural-alphabet-based tools (e.g. 3D-BLAST), has lower sensitivity than Dali, but higher than the structural aligner CE, and similar to TMalign and TMalign-fast; all while being up to >100,000x faster than previous tools.

More info on TM-align here.

DALI:

I’ve also messed around with DALI (distance matrix alignment method): more info here and here; all refs here.

The DALI server was overloaded, and my jobs were queued. It took a day to get the results back for the structural search of a known protein against PDB. Nonetheless, the top hit was the expected protein, and other top hits were homologs in other species.

At the moment, with my limited clicking around, these results do not seem to link you to the AF or PDB pages of structures they aligned to....

In addition to matches and match stats, the results also offered other "buttons" to click and explore. The one I found most interesting for the moment was the "Pfam" button where, if I'm not mistaken, they shown Pfam domains on your protein that were structural matches (as opposed to normal sequence matching of Pfam domains).

FATCAT:

FATCAT stands for "Flexible structure AlignmenT by Chaining Aligned fragment pairs allowing Twists" and is discussed more in the "Compare protein structures to each other" section below. More info here, here, and here; simple explanation here.

The FATCAT server has a few modes:

- Pairwise Alignment: align two structures with FATCAT. Will be discussed more below.

- Structural Comparison of Close Homologs: search for proteins with high sequence similarity to a query, then align it to structures for those proteins with FATCAT. I used Model 1 for Drosophila MCM2 from Robetta as an example. It gave the following hits:

It requires high sequence similarity to a protein in the PDB database before trying to align, so it is no wonder that it pulls up the MCM2 structures. If you click on "view" in the "align" column, it brings you to a page like this:

It has the downloads you might want, and also an interactive viewer to see the structural alignment:

- Search for Similar Protein Structures: Search a protein structure database for similar structures with FATCAT.

Since the above "close homolog" mode keeps only the very small subset of candidates in the PDB website with highly similar protein sequences, it runs relatively fast. This mode doesn't have that pruning process. Instead, I assume, it is attempting to perform alignments of your query to ALL of the structures in the database, and thus it takes far far longer to finish. FoldSeek is much much faster in both 3Di/AA and TM-align modes, but I am unable to comment on why -- could be purely algorithmic, could be based on a pruning process, both...

UPDATE:

Ultimately it took < 2 hours. The top hit was an MCM2 structure, so I can also vouch for FATCAT structure database searches. There were 86 hits. It appears that these hits were filtered to have a p-value < 0.05. It is unclear how many tests were done, and FDR/q-values is not given. At the top of the page it has an interactive plot that allows you to see some summary statistics on the hits

P-value vs. Length

Length vs Alignment Length

Length vs Sequence Identity

Lower on the page, it gives the structural alignment table:

You can click on the structures: for example, the top hit brings you to a crystal structure of the MCM hexamer. I clicked the top 10 structures: all contained MCM. Clicking on "view" under "align" brings you to a page to download files or view more plots:

3D-Surfer:

3D-Surfer offers database searches as well, but the results on knowns were confusing to me… None of the "hits" it showed were MCM2 -- well at least none that I clicked on. Even when I had them show me the results table for 1000 hits, searching only the AlphaFold database, none of the hits were the AlphaFold MCM2 model I used... So, I can’t vouch for 3D-surfer per se (like I can for FoldSeek and Dali), but 3D-surfer does at least give me hits to consider for all my query structures predicted by Robetta for my proteins-of-interest — whereas FoldSeek found no hits for most. So I don't know what to make of the hits it is giving me in that scenario... but I can see that the protein structures returned do look a bit like the queries when it is simple enough to conclude that "by eye".

Example 3D-Surfer Results:

Example 3D-Surfer results table:

Compare protein structures to each other:

In addition to searching databases for 'structure hits', I also want to be able to align different structures together. For example, if there are 2 predicted structure models, I want to align them to see the areas of agreement or disagreement. And I want to be able to align known structures to predicted structures. And so on. There are bound to be many tools that do this type of stuff. I know none of them at the moment, although the programs above likely offer this type of service. For example, I know DALI allows you to do pairwise alignments on all your own stuff, not just databases. And FoldSeek should too... though I might have to download it and do that at the command-line. A colleague uses PyMol. For the moment, that's all I have.

RSCB PDB 3D-View and Pairwise Structure Alignment:

The 3D-View Tool allows you to import multiple PDBs. By default the structures are drawn in different parts of the 3D space. This allows you to compare them by eye, but they are not structurally aligned. While there seem to be buttons concerning aligning the structures, I couldn't figure out how to do that. Meanwhile, the Pairwise Structure Alignment Tool does do structural alignments, allowing a few different ways to do that. I gave all of them a try using the first two models predicted for Drosophila MCM2 by Robetta (which seems to predict 5 by default). Next I will name the alignment modes, and describe how I interpret the results for these two structures based on a nearly absent (but not absent) understanding of what each mode is doing. For more info on what the modes do, see here.

Overall, it appears from a visual point-of-view that the alignment modes fall into roughly 3 categories.

Category 1 : Rigid Optimization

Most modes (4 of 6) fit into what I'd term a "rigid optimization" category: jFATCAT (rigid), jCE, TM-Align, and Smith-Waterman 3D. The all give similar alignments (the SW-3D mode to a lesser extendt). They seem to optimize the amount of the two MCM structure models matching in 3D-space, but do not make changes (or at least not large changes) to the structure or assume it can take on other conformations. The result is that the MCM2 models almost fully align in the ball-of-yarn globular domain(s) of the protein but the N-terminal disordered arms (on bottom left of image) go off in different directions (and are not force aligned in 3D-space). Smith-Waterman 3D, by the looks of it, does the least amount of structure-changing/conforming (or none at all): so while you can see similar structures between models nearby in 3D space, they are much less overlapping than the other 3 modes. Note that FATCAT stands for "Flexible structure AlignmenT by Chaining Aligned fragment pairs allowing Twists". CE stands for "Combinatorial Extension".

- jFATCAT (Rigid) : Gives similar results to jCE and TM-align, and to a lesser-extent, Smith-Waterman 3D. This mode is explained by RCSB as allowing "for flexible protein structure comparison", and, "The rigid flavor of the algorithm is used to run a rigid-body superposition that only considers alignments with matching sequence order. For most structures the performance of this structure alignment is similar to that of CE." Also see: https://fatcat.godziklab.org/fatcat/fatcat_pair.html

- jCE : Gives similar results to jFATCAT (rigid) and TM-align, and to a lesser-extent, Smith-Waterman 3D. RSCB says it "works by identifying segments of the two structures with similar local structure, and then combining those regions to align the maximum number of residues in order to keep the root mean squared deviations (rmsd) between the pair of structures low. This Java port of the original CE uses a rigid-body alignment algorithm. Relative orientations of atoms in the structures being compared are kept fixed during superposition. It assumes that aligned residues occur in the same order in both proteins (i.e., the alignment is sequence-order dependent)."

- TM-align : Gives similar results to jFATCAT (rigid) and jCE, and to a lesser-extent, Smith-Waterman 3D. RCSB explains TM-align as, "Sequence-independent protein structure comparison" that is "sensitive to global topology". Note that TM-align can also be performed at this website that explains TM-Align as "an algorithm for sequence-independent protein structure comparisons. For two protein structures of unknown equivalence, TM-align first generates optimized residue-to-residue alignment based on structural similarity using heuristic dynamic programming iterations."

- Smith-Waterman 3D : Similar to above 3, but more separation between clearly-matching structures. RSCB explains that it "aligns similar sequence segments using Blosum65 scoring matrix ... and aligns two structures based on the sequence alignment." They give the following advice/warnings: "Note that this method works well for structures with significant sequence similarity and is faster than the structure-based methods. However, any errors in locating gaps, or a small number of badly aligned residues can lead to high RMSD in the resulting superposition."

Category 2 : Flexible Optimization

- jFATCAT (flexible) : This seems to further optimize the amount of the two MCM models that match in 3D space by assuming more flexibility/conformational freedom. The net result is basically that it is able to align those disordered N-terminal arms as well. This may be reasonable and the two models may simply reflect that the N-terminal arm is flexible and may take on 1 of multiple conformations. RSCB explains the flexible option in the following way: "The flexible flavor of FATCAT introduces twists (hinges) between different parts of the superposed proteins so that these parts are aligned independently. This makes it possible to effectively compare protein structures that undergo conformational changes in specific parts of the molecule such that global (rigid body) superposition cannot capture the underlying similarity between domains. For example, when the two polymers being compared are in different functional forms (e.g., bound to partner proteins/ligands), were crystallized under different conditions, or have mutations. The downside of this approach is that it can lead to false positive matches in unrelated structures, requiring that results be carefully reviewed." The take-home is that it increases sensitivity at the cost of losing specificity. Also see: https://fatcat.godziklab.org/fatcat/fatcat_pair.html

Category 3 : WTF

- jCE-CP : I'm actually not entirely sure how to make sense out of what happened here.... more colors than I expected for one.... RSCB explains it: "Some protein pairs are related by a circular permutation, i.e., the N-terminal part of one protein is related to the C-terminal part of the other or vice versa, or the topology of loops connecting secondary structural elements in a domain are different. Combinatorial Extension with Circular Permutations allows the structural comparison of such circularly permuted proteins." The take-home for this analysis though is that it does not appear to be an appropriate choice.

FATCAT :

- Pairwise Alignment: align two structures with FATCAT.

As mentioned twice above, there is a server dedicated to FATCAT that allows pairwise alignments and database searches (the latter discussed above).

Here is an example of aligning Model_1 and Model_2 together (as above). First it shows you a summary page like this:

The interactive viewer allows you to see the alignment:

Note that it offers both the "rigid" and "flexible" modes. The above was run in "flexible" mode. Compare it to the image shown for FATCAT (flexible) run on the RSCB server.

POSA - Partial Order Structure Alignment:

From the makers of FATCAT, is POSA, a way to align >2 structures and visualize in 3D space. My immediate use application for this is simple: Robetta output 5 models for Drosophila MCM2, let's align them.

There were two sets of results: results with flexibilities incorporated, and results without. I'm going to just use the FATCAT language here and call those two "flexible" and "rigid", respectively. First the rigid results. You can look at all the structures side-by-side or overlapping. When overlapping you can show all parts of each molecule, but you can also highlight or only show the "common core regions" shared by the molecules.

Rigid: side-by-side structures

Rigid: Show All

Rigid: Highlight common core regions

Rigid: Show only the common core regions

Above, the only region not in the "common core" is the N-terminal disordered arm. So may be reasonable to conclude that the 5 models are very similar except for that arm, although I can definitely see 'disagreeable' areas in the other part as well. We can also see the flexible mode results to see if the arm and those other regions were considered flexible enough to align more closely together:

Flexible: Show All

Flexible; Highlight common core regions

Flexible: show only the common core regions

DALI All-by-All pairwise analysis of 5 Robetta Models of a given protein sequence:

DALI database searches were discussed above. It also offers pairwise analyses. I didn't find a viewer for the structures aligned in 3D space per se, but it offers metrics on structure similarity as shown below.

DALI: Structural similarity tree

DALI: structural similarity matrix

DALI: Correspondence Analysis (conceptually similar to PCA)

Evaluating protein structures.

AlphaFold and Robetta may offer scores regarding how confident they are in the protein structures they predict. For example, Robetta offers a confidence score between 0-1. However, one may also be interested in third party programs to run evaluations of experimental- and predicted- structures from various programs. A colleague told me to try "SAVES":
- https://www.doe-mbi.ucla.edu/saves/
- https://saves.mbi.ucla.edu/

It offers a few programs, all you need is the PDB file. Here, again, I will use Drosophila MCM2 from AlphaFold:
- ERRAT: statistics of non-bonded interactions between different atom types ; compares with statistics from highly refined structures.
- Verify3D: Determines the compatibility of an atomic model (3D) with its own amino acid sequence (1D); compares the results to good structures.
- PROVE: Calculates the volumes of atoms in macromolecules; calculates a statistical Z-score deviation for the model from highly refined structures.
- WHATCHECK: extensive checking of stereochemical parameters of the residues in the model.
- PROCHECK: stereochemical quality of residue-by-residue geometry and overall structure geometry.
- CRYST: searches the Protein Data Bank for entries that have a unit cell similar to your input file.

Example Results:

Clicking on the "Results" button for each gives more information and plots on each analysis. Overall, ERRAT liked this predicted structure. VERIFY and PROVE didn't. PROCHECK was mixed. WHATCHECK was a lot of information that triggered me into a TLDR state. The colleague who pointed me here said she only uses "ERRAT" - so if that's the case, then great job AlphaFold! Otherwise, I have questions.