Written by Meji Abidoye.
We run thousands of experiments a day at Abundant and have tried a number of different approaches. This document captures where we have settled and why we think continuing to pursue this path is a good idea. This document will not go into great detail about the alternatives we have tried, although we might discuss that elsewhere at some point.
We have these overarching goals:
These constraints have led us to accept the pull request and continuous integration as the base units of experimentation at Abundant. The pull request is a natural expression of an experiment that people (technical and non-technical) are already familiar with. To wax poetic for a moment, a pull request is an expression of a desired change to reality as denoted by what is in the main branch, and to continue the analogy, continuous integration is a way of ensuring that desired change does not break current reality within the parameters we care about as defined by our specified tests. A pull request is a hypothesis and ci is a test of that hypothesis. Pull requests also provide a simple and natural way to collaborate with others. For log storage we just dump stuff to s3. Nice and easy.
Our original contribution is the following.
Here’s an example experiment manifest.
version: "1.0"
metadata:
name: "Security Vulnerability Detection"
description: "Tests agent performance on identifying and fixing security vulnerabilities"
k: 5 # not implemented yet
timeout: 600 # seconds, not implemented yet
tasks:
- command-injection-vulnerability
- cicd-secrets-leak-scanner
- api-change-guard
agents:
- nop
- oracle
- claude-code@claude-sonnet-4-5 # agent versioning not implemented yet
- gemini-cli
- temrminus@openai/gpt-5.1-2025-11-13
compute:
- morph # not implemented yet
- lambda # not implemented yet
- actions
- modal # not implemented yetTasks are scenarios defined in the specifications from well known benchmarks such as terminal bench, swe-bench etc.
An agent is a combination of harness and model provider. There are two default agents we include in any manifest, nop and and oracle. Nop means no-operation which produces a baseline failure that represents the null hypothesis. Oracle which produces a known working solution. For the other agents we specify them as harness[@modelid].
The experiment viewer is where we explore logs after runs are complete. This is just a convenience wrapper over an s3 bucket and team members are free to download the underlying logs for any local analysis they would like to perform.
Here’s a screenshot of an experiment with multiple runs. Each run corresponds to a single commit to the pull request branch. More on this will be said later in the document. This run was done on a single compute platform.
The aggregate result from a run can be viewed in isolation
Or they can be compared across time
This comparison for examples shows the impact of changing the task image from node-18-slim to node-22-slim.
Clicking on the badge for a single agent will take you to a detail page
Which contains the results from the run, the agent trajectory and a frozen snapshot of the scenario definition at the time the experiment was run.
There are custom viewers for each kind of agent trajectory for ease of analysis.
The underlying plaintext is also available for download
And links correspond 1:1 to s3 objects so they are easy to share and are easy to query directly by team members.
The end.
We should mention the lifecycle
Strongest would be to use this to climb a
If you are creating or using rl environments, you need to know which environments are worth using and which ones are not. The only way to know is to run agents inside the environments and observe how the agents perform.
If you are hand crafting environments one at a time, it is acceptable to do these runs one by one and read the raw logs (in fact I recommend you start here to appreciate the scope of the problem). If you plan to create or use environments at any kind scale (e.g. see the work in our task creator SWE-gen) you quickly run into many tooling deficiencies. Here's a non-exhaustive list:
At Abundant, we run thousands of experiments every day and we have experienced these and many other foot guns first hand. This document captures how we've settled on running experiments; the pieces we adopted and the pieces we built, why we have chosen to run them this way, and why we think running them this way is a good idea.
From my experience, a good experimentation framework should be driven by some principles:
These principles led us to leverage the pull request and continuous integration as the base units of an experiment at Abundant. The pull request is a natural experession of an experiment that people (technical and non-technical) are already familiar with. Furthermore, continuous integration on a platform like github is a standard way of kicking of automation that