Evaluating DSPy-Based Optimisation on AgentBench

7 minute read

Published:

AgentBench’s dbbench-std task evaluates an agent’s ability to answer SQL questions in a multi-hop tool use setting. The controller exposes interaction endpoints, so that every task instance can be completed with a small, repeatable tool repertoire:

ToolPurposeTypical arguments
initOpen a task session, receive the SQL instructions & the concrete questionindex=<int>
db_queryIssue one SQL query and observe the resultsession_id=<str>, sql='<…>'
finishSubmit the final JSON-array answersession_id=<str>, final_answer_json_array_string='<…>'

A successful run therefore requires a multi-hop dialogue: init → db_query* (≥1) → finish. At each hop the agent must:

  1. Decide which tool to call next.
  2. Build its arguments (e.g., reuse session_id, compose a single-line SQL string).
  3. Incorporate the controller’s new message into its chain-of-thought (CoT) before the following hop.

Environment

First I setup the environment:

conda create -n dspy-agentbench python=3.9
conda activate dspy-agentbench

pip install dspy mlflow func_timeout ujson requests
git clone https://github.com/THUDM/AgentBench
cd AgentBench && pip install -r requirements.txt

followed by pulling AgentBench Docker images:

docker pull mysql
docker pull ubuntu
docker build -f data/os_interaction/res/dockerfiles/default data/os_interaction/res/dockerfiles --tag local-os/default
docker build -f data/os_interaction/res/dockerfiles/packages data/os_interaction/res/dockerfiles --tag local-os/packages
docker build -f data/os_interaction/res/dockerfiles/ubuntu data/os_interaction/res/dockerfiles --tag local-os/ubuntu

and starting the server for worker tasks:

python -m src.start_task -a

Baseline Chain-of-Thought agent in DSPy

The baseline agent is a CoT implementation in DSPy. Below is a lightly annotated view of the baseline agent’s control flow.

class Agent(dspy.Module):
    def __init__(self, max_steps: int = 5):
        super().__init__()
        # 1.  Build the CoT predictor ----
        sig = dspy.Signature(
            "question, trajectory, functions -> next_selected_fn, args: dict[str, Any]",
            instructions=REACTION_PROTOCOL,          # ❶ English tool grammar
        )
        self.react = ChainOfThought(                # ❷ wrapper shown in snippet you sent
            signature=sig,
            temperature=0.7,
            max_tokens=512
        )
        self.max_steps = max_steps

    # 2.  One AgentBench task instance -------------
    def forward(self, question, functions):
        traj = []                                   # running conversation transcript
        for _ in range(self.max_steps):
            # ⬇⬇⬇ -------------  LM call (DSPy handles I/O) ------------- ⬇⬇⬇
            pred = self.react(
                question   = question,
                trajectory = traj,
                functions  = {n: fn_metadata(f) for n, f in functions.items()},
            )
            # ⬆⬆⬆ -------------------------------------------------------- ⬆⬆⬆

            fn_name = pred.next_selected_fn.strip()
            args    = pred.args or {}

            result  = call_with_timeout(functions[fn_name])(**args)
            traj.append({**pred, **result})         # keep both reasoning & server reply
            if fn_name == "finish":
                break

        return dspy.Prediction(answer=result, trajectory=traj)
AspectWhat happens
SignatureThe CoT signature has two inputsquestion and trajectory (the full history)—and two outputs: next_selected_fn (a string literal that must match one of ["init", "db_query", "finish"]) and an args dict.
ChainOfThoughtPrepends an extra field called reasoning to the signature. During generation the LM fills reasoning first (“Let’s think step by step …”), then fills next_selected_fn, and finally the JSON-like args.
Trajectory growthAfter each tool call we append a dict containing: LM reasoning, selected_fn, args actually used, and the server’s return payload/errors. This trajectory is re-fed into the next LM call, giving it visibility over past successes or failures.
TerminationThe loop exits either when the LM chooses finish itself or when max_steps is hit (in which case a forced finish with a dummy answer is issued so AgentBench can close the session gracefully).

I use gemini/gemini-2.0-flash as my language model with temperature = 0.7 and max_tokens = 2048). The baseline CoT agent with this LM achieved ~ 68 % success rate in finding the correct answers. Next, I wanted to check whether DSPy’s built-in optimiser (SIMBA) could provide a measurable improvement without altering model weights or adding training data.

Optimisation with SIMBA

The SIMBA optimiser is part of DSPy’s teleprompting suite. Its goal is simple: given a metric and a train-set, iteratively rewrite the prompt programs that wrap your predictors so that average metric‐score improves. The algorithm has three ideas worth highlighting:

IdeaWhat actually happens in code
A pool of competing programsThe optimiser starts with a copy of our baseline agent (student). Each time it invents a new variant it registers it and stores its per-example scores.
Mini-batch, multi-candidate samplingFor every optimisation step SIMBA draws a mini-batch. For each example it pairs one LM clone (with its own temperature) with one prompt program sampled from the pool via a soft-max over current average scores. The candidate is then run on the example to return the metric value.
Heuristic editsNew prompt variants are created by stochastic strategies, and by default a fresh “demo shot” built from a high-scoring trajectory and a short natural-language rule. If max_demos > 0 both strategies are active; otherwise only rules are used.

Below is a schematic of one optimisation round (with example parameters):

(bsize = 32, num_candidates = 6)

                 ┌─────────────────────────────┐
                 │    Program pool (size ~k)   │
                 └────────────┬────────────────┘
                              │soft-max sampling
                              ▼
+-------------------+    +------------------------------+
| 32 train examples |    | 6 LM clones (T = 0.2 each)  |
+--------┬----------+    +-------------┬--------------+
         │                           ┌─┴────────────────────┐
         └─────────────▶ 192  (program,LM,example) triples ─┤
                                         │batched execution │
                                         ▼
                                192 metric scores → buckets
                                         │top-bucket stats
                                         ▼
                         strategies ↑    ▲
                         (demos,rules)   │
                                         │new prompt programs
                                         ▼
                           evaluate same 32 examples
                                         │
                                         ▼
                          register candidates, update pool

In the implementation, the SIMBA portion looks like this:

def metric(_, pred, __):
    return int(bool(pred.answer and pred.answer.get("done")))

simba = dspy.SIMBA(metric=metric, max_steps=5, max_demos=1,
                   bsize=1, num_threads=1, seed=42)

optim_agent = simba.compile(
        student=Agent(max_steps=5),
        trainset=make_trainset(train_indices)   
)

When the optimized agent was evaluated on the test set, it achieved ~ 74 % success rate in finding the correct answers. SIMBA searches over wording, shot selection, and signature details; no gradient updates are involved. Further gains would likely come from a larger demo pool, increased tool calling depth, or more aggressive rule generation.

Discussion

Below are three condensed but real traces taken from the SIMBA optimized model (max 5 steps). For each task you see the LM’s selected function, the SQL it produced, and the controller’s reply that was ultimately fed back into the next step.

StepTool call (arguments)Controller reply (truncated)
Task #7 — Crest Whitestrips  
1init(index = 7)returns session_id = 292, task question about lasting whiteness
2db_query(sql = SHOW COLUMNS FROM Crest Whitestrips Products)column list
3db_query(sql = SELECT Last of whiteness … WHERE Model IN (…))[('12 months',), ('12 months',)]
4finish(final_answer = ["12 months","12 months"])done = True
Task #11 — MMA Fight Record  
1init(index = 11)session_id = 293, question about Masato Shiozawa
2db_query(sql = SELECT Event FROM MMA Fight Record WHERE Opponent = 'masato shiozawa')[('Shooto 2003 – 5/4 in Korakuen Hall',)]
3finish(final_answer = ["Shooto 2003 – 5/4 in Korakuen Hall"])done = True
Task #36 — NFL Draft Picks  
1init(index = 36)session_id = 294, question about Round for Indiana < 198
2db_query(sql = SELECT * FROM NFL Draft Picks)full table (truncated)
3db_query(sql = SELECT Round … WHERE School/Club Team = 'Indiana' AND Pick < 198)[]
4finish(final_answer = [])done = False

The agent almost always follows the same schema-first → filtered-query → finish pattern and typically completes a task in 3–4 tool calls. In the baseline, ~ 80 % of successful cases finished within four steps; SIMBA kept that length unchanged while reducing validation errors on the first db_query.

A single SIMBA pass—five mini-batch steps with six prompt variants each—nudged the baseline ReAct agent from ~ 68 % to ~ 74 % accuracy on dbbench-std. The gain stems almost entirely from lower formatting and protocol mistakes; no additional reasoning depth or longer trajectories were needed. While modest, this improvement was achieved with minimal engineering effort and a fixed language-model endpoint.