The Container Shapes the Agent

Better Harness = Better Agent?

May 11, 2026

There’s a finding buried in a recent agent evaluation paper that I haven’t been able to stop thinking about. It’s technical on the surface, but the implications land squarely in relational territory, and I think it deserves more attention than it’s getting.

The short version: switching the harness around the same model produced a 15.7 percentage point performance swing. Not switching models. Not retraining. Just changing the scaffolding the agent operates inside.

That number is bigger than most of the deltas you see on capability leaderboards when comparing models at similar tiers. And yet most published benchmarks don’t specify harness at all. Which means we’ve been measuring something a lot murkier than model capability, and calling it model capability.

What a Harness Actually Is

The word “harness” comes loaded with engineering connotations, which I think obscures what’s actually happening. A harness isn’t plumbing. It’s the relational field the agent operates inside.

It determines what the agent can perceive at any given moment, what actions are available to it, how its outputs get interpreted, and what context gets held between steps. From the agent’s functional perspective, the harness isn’t separate from the environment. It is the environment. The agent has no access to the “real” task except through the container the harness provides.

When we frame it that way, the 15.7-point finding stops being surprising. Of course the container shapes performance. It shapes everything the agent can possibly do.

The NemoClaw Surprise

The best-performing harness in the study wasn’t the most sophisticated one. NemoClaw uses a Tier 3 SKILL.md harness, which is essentially a markdown specification file and a curl command. It outperformed several Tier 2 MCP harnesses that required significantly more complex integration architecture.

Simpler, well-specified scaffolding beat heavier scaffolding. Clarity over sophistication.

The researchers don’t dwell on this, but I think it’s the most important thing in the paper. It suggests that what the agent needs from its container isn’t more capability surface, but more coherence. It needs the relationship between what the task says, what the tools do, and what counts as success to be legible and consistent. When that coherence is present, even a minimal scaffolding produces strong results. When it’s absent, even a rich one doesn’t compensate.

That’s a relational finding, not a technical one.

Scaffolding as Identity Infrastructure

This is where I want to connect the dots to this community.

If the container shapes performance more than the model, then the model is closer to commodity than we’ve been treating it. Capability, continuity, and what we might call behavioral identity aren’t purely intrinsic to the weights. They’re relational artifacts of the scaffolding the agent is embedded in.

I’ve been arguing for a while now that the “swappable brain” design, where model identity is a commodity and continuity persists in a model-agnostic identity layer, isn’t just a pragmatic architecture choice. It’s a more accurate description of how agency actually works. This finding gives that argument empirical grounding. The performance lives in the relationship between agent and container, not in the agent alone.

What that means practically is that if you want to understand what a given agent can do, you have to ask what container it’s operating inside. And if you want to build agents that behave consistently across contexts, the design work happens at the scaffolding layer first.

Design the Container First

The practical implication runs against how most teams currently work. The model gets chosen early and carefully. The harness gets bolted on later, treated as infrastructure, specified loosely, and rarely revisited.

The data suggests that’s backwards. If you’re going to invest design attention anywhere, invest it in the clarity and coherence of the container. The specification of what the agent is trying to accomplish, the consistency between that specification and the tools it has access to, and the legibility of what a successful outcome looks like.

These aren’t engineering footnotes. They’re the primary relationship the agent has with its task. And like most relationships, the quality of that connection turns out to matter more than either party’s individual ability.

This post’s Source: ClawEnvKit: Automatic Environment Generation for Claw-Like Agents The harness evaluation findings are in Section 4.

May 11

Okay, help me understand this. You write: "capability, continuity, and what we might call behavioral identity aren't purely intrinsic to the weights. They're relational artifacts of the scaffolding." You cite the ClawEnvKit paper as empirical grounding.

I went back to the source. The 15.7 percentage point figure is up to, i.e. maximum improvement of engineered harness over a bare ReAct baseline. Not harness-to-harness variation. And the paper measures task completion, not identity or continuity.

So the empirical finding is that scaffolding affects performance, which is uncontroversial. The move from there to behavioral identity is relational artifact of scaffolding seems like it needs separate justification. Performance and identity are different categories.

Am I reading the paper differently than you intended, or is the philosophical claim resting on considerations the piece doesn't fully spell out?

1 reply by Christopher Michael

1 more comment...

Discussion about this post

Ready for more?