With every major benchmark, AI models close more of the gap with human engineers. Now, the remaining gap is not a knowledge problem. It is a judgment problem. The models lack the kind of taste and intuition that humans develop through structured experience, and no amount of scaling the current recipe will produce it. Closing this gap requires a discipline that looks nothing like software development. Instead, it’s built around ambiguous goals, non-deterministic outcomes, and iteration cycles measured in weeks. That discipline is hillclimbing, and it is the core bottleneck on the path to AGI.
The practice of making a number go up
In hillclimbing, the engineering directive is not "add this feature" or "fix this bug." It is "get better at software engineering." Nobody can hand you a spec for that. Nobody can design it in Figma. You define the capability as best you can in evals, you climb toward it with data, and you iterate. Hillclimbing is the practice of making a number go up for a capability that resists clean definition.
The problem is that the capability always resists clean definition. When your benchmark is SWE-bench, which tests whether a model can resolve real GitHub issues, you're measuring something real. But the capability you actually care about, "be good at software engineering," is much broader and less numeric. It includes debugging intuition, architectural taste, knowing when not to write code. The eval captures a slice of the whole capability pie. SWE-bench Pro raised the bar with harder, multi-file tasks. But each new benchmark is still a proxy for a capability.
This is related to Goodhart's Law: "when a measure becomes a target, it ceases to be a good measure." In hill climbing, we have no choice but to turn measures into targets. We need a number to climb toward. But the number is always a proxy, never the thing itself.
The LMArena controversy showed what happens when the proxy drifts. Model developers optimized for Arena scores that stopped reflecting real model quality. They ended up getting better at the test without getting better at the skill. Nobody caught it until users did.
This is not a failure mode you can engineer away. It is a permanent feature of the work. Every eval you build will eventually degrade. Every number you climb toward will eventually stop meaning what you thought it meant. The discipline is noticing when that happens and rebuilding the eval before the model learns the wrong lessons. The difference between a good eval and a bad one is whether the model develops the real capability or just learns to satisfy the verifier through shortcuts. That is why eval design might be the most important role in AI today.
Everything above describes the challenge of a single act of hillclimbing. At scale, it gets much, much harder, because the model is one shared artifact and nothing is isolated.
We've had decades of experience dividing work between teams in an application. One team works on the frontend. One team works on the backend. You have the two-pizza teams from Jeff Bezos. You have microservices. You have the ability to split products by strict boundaries of APIs, of products that don't need to talk to each other, and teams that don't need to talk to each other.
In capabilities development, a thousand people work on a single model, and one person can undo the change of another through spurious correlations. You improve math reasoning, and somehow instruction following gets worse. On top of that, the iteration cycles are damn long and non-deterministic. You have every expectation that the model will get better by adding more data, more parameters, a different architecture, but it might not happen and you won’t know for weeks. And because each cycle takes weeks, by the time you discover that your math improvement broke instruction following, three other teams have shipped changes on top of it.
The mental model I find most useful for what this actually looks like is an expanding circle. Think of the model's current capabilities as a boundary. The goal is to push that boundary outward in every direction simultaneously. Every researcher, every data contributor, every eval designer is pushing on a different edge, trying to grow it without denting it somewhere else. AGI is when the circle fills the space of everything humans can do.
But because the model's capabilities are entangled, progress in one area can silently compress the boundary in another. The circle bulges unevenly. The model becomes lopsided in ways that users notice before benchmarks do. And sometimes the evals themselves degrade, so the team thinks the circle is growing when really their evals have stopped measuring what they think they're measuring. All three of these failure modes happen constantly, and they happen at the same time.
The discipline of hill climbing at scale is keeping the circle round while making it grow. Not a single path up a single hill, but a sphere inflating from the inside, with the whole system of evals and environments and feedback loops keeping a thousand people pushing outward together without the sphere collapsing on itself.
This is the work, and it's not work that frontier labs can do alone. Researchers at labs are focused on pushing the boundary. They need the evals, the environments, and the feedback infrastructure to already exist so they can climb. Building the infrastructure that lets a thousand researchers push outward at once, requires its own dedicated effort, its own team, and its own obsession with measurement quality.
That's what Abundant does. We build the evals, the environments, and the systems that make hill climbing possible at scale. The bottleneck to AGI is not a shortage of ideas or compute. It's the gap between what we can measure and what we actually want. We exist to close that gap.