Our hillclimbing lifecycle at Abundant

Written by Meji Abidoye.

We run thousands of experiments a day at Abundant and have tried a number of different approaches. This document captures where we have settled and why we think continuing to pursue this path is a good idea. This document will not go into great detail about the alternatives we have tried, although we might discuss that elsewhere at some point.

We have these overarching goals:

We must do good science i.e reproducibility within the limits of non-determinism. Every experiment must run under the same conditions and we must have maximum control over those conditions which implies that our experiments must run remote first.
We must enable easy and natural collaboration using tools that people are already familiar with and where those tools do not already exist, that are easy to adopt.
We must make it easy to store, explore, visualize and share results with internal and external stakeholders.

The pieces we have adopted

These constraints have led us to accept the pull request and continuous integration as the base units of experimentation at Abundant. The pull request is a natural expression of an experiment that people (technical and non-technical) are already familiar with. To wax poetic for a moment, a pull request is an expression of a desired change to reality as denoted by what is in the main branch, and to continue the analogy, continuous integration is a way of ensuring that desired change does not break current reality within the parameters we care about as defined by our specified tests. A pull request is a hypothesis and ci is a test of that hypothesis. Pull requests also provide a simple and natural way to collaborate with others. For log storage we just dump stuff to s3. Nice and easy.

The pieces we have invented

Our original contribution is the following.

The experiment Manifest, this is how we express the scenarios we want to explore.
The log viewer, this is where we explore, visualize and share results, over time this will also evolve into a collaboration layer.

The experiment manifest

Here’s an example experiment manifest.

version: "1.0"
metadata:
  name: "Security Vulnerability Detection"
  description: "Tests agent performance on identifying and fixing security vulnerabilities"
k: 5 # not implemented yet 
timeout: 600 # seconds, not implemented yet
tasks:
  - command-injection-vulnerability
  - cicd-secrets-leak-scanner
  - api-change-guard
agents:
  - nop
  - oracle
  - claude-code@claude-sonnet-4-5 # agent versioning not implemented yet
  - gemini-cli
  - temrminus@openai/gpt-5.1-2025-11-13
compute:
  - morph # not implemented yet
  - lambda # not implemented yet
  - actions 
  - modal # not implemented yet

Tasks are scenarios defined in the specifications from well known benchmarks such as terminal bench, swe-bench etc.

An agent is a combination of harness and model provider. There are two default agents we include in any manifest, nop and and oracle. Nop means no-operation which produces a baseline failure that represents the null hypothesis. Oracle which produces a known working solution. For the other agents we specify them as harness[@modelid].

Sauron: the experiment viewer

The experiment viewer is where we explore logs after runs are complete. This is just a convenience wrapper over an s3 bucket and team members are free to download the underlying logs for any local analysis they would like to perform.

Here’s a screenshot of an experiment with multiple runs. Each run corresponds to a single commit to the pull request branch. More on this will be said later in the document. This run was done on a single compute platform.

The aggregate result from a run can be viewed in isolation

Or they can be compared across time

This comparison for examples shows the impact of changing the task image from node-18-slim to node-22-slim.

Clicking on the badge for a single agent will take you to a detail page

Which contains the results from the run, the agent trajectory and a frozen snapshot of the scenario definition at the time the experiment was run.

There are custom viewers for each kind of agent trajectory for ease of analysis.

The underlying plaintext is also available for download

And links correspond 1:1 to s3 objects so they are easy to share and are easy to query directly by team members.

The life cycle of an experiment

I decide to run an experiment so I create a new branch that can contain new scenario definitions or use the scenarios that exist in our scenario bank.
I create a manifest that lists the scenarios I want to run and obtain a baseline by listing just the nop and oracle agents and my target compute platforms. This ensures that the scenarios I want to run are valid. This is an optional but recommended step. CI kicks off a run across all my configurations when I push a new commit. I receive a link where I can view my results in a comment on my pull request.
I fix or filter any invalid scenarios i.e scenarios that do not meet the baseline.
I update my manifest to include any new agents I want to target and I push a new commit. CI triggers a run across all my configurations. I receive a new link that contains my results.
I iterate on task definitions, agent and compute configurations each time pushing a new commit and each time receiving a new link that allows me to compare my results.
I write my observations directly on the PR, tagging in teammates as is necessary.
I compare results across runs in Sauron.
I merge the pull request when I am done.
The main branch now contains a record of the final state of my experiment and the pull request branch and the pull request itself contain a record of all the states explored during my experiment.
The end.

The end.

Notes

We should mention the lifecycle

Strongest would be to use this to climb a

Rewrite

If you are creating or using rl environments, you need to know which environments are worth using and which ones are not. The only way to know is to run agents inside the environments and observe how the agents perform.

If you are hand crafting environments one at a time, it is acceptable to do these runs one by one and read the raw logs (in fact I recommend you start here to appreciate the scope of the problem). If you plan to create or use environments at any kind scale (e.g. see the work in our task creator SWE-gen) you quickly run into many tooling deficiencies. Here's a non-exhaustive list:

reading raw agent trajectories is painful
triaging failure modes across rollouts is painful
running trials at scale on a single machine is close to impossible
etc. etc.

At Abundant, we run thousands of experiments every day and we have experienced these and many other foot guns first hand. This document captures how we've settled on running experiments; the pieces we adopted and the pieces we built, why we have chosen to run them this way, and why we think running them this way is a good idea.

From my experience, a good experimentation framework should be driven by some principles:

We must do "good science". Every experiment must be run under the same conditions and we must have maximum control over those conditions and how we express them. This implies that experiments must run remote first.
We must enable easy and natural collaboration using tools that our people are already familiar with. Where those tools do not already exist, we must build tooling that fits into our people's existing workflows and are easy to adopt.
We must make it dead simple to store, explore, visualize, analyse and share the results of experiments with internal and external collaborators.

These principles led us to leverage the pull request and continuous integration as the base units of an experiment at Abundant. The pull request is a natural experession of an experiment that people (technical and non-technical) are already familiar with. Furthermore, continuous integration on a platform like github is a standard way of kicking of automation that