The Eval Harness Caught a Real Bug on Day One

In my previous post, I built the agent and the safety framework around it which included contracts, guardrails, metrics. The agent ran. It returned structured JSON every time. It failed safely when things went wrong.

But “it runs” is not the same as “it behaves correctly.”

Phase 3 is where I stopped trusting the output and started measuring it.

What an Eval Harness Actually Is

The concept is simple: define a set of test scenarios, define what a good response looks like for each one, run the agent against them repeatedly, and score the results.

The tricky part is defining “good.” For an AI agent, good isn’t a single assertion. It’s a profile of behaviors across multiple dimensions. I structured this as a rubric attached to each eval case:

class EvalRubric(BaseModel):
    expected_status: Literal["triaged", "needs_info"] | None
    min_confidence: float | None       # response confidence must be >= this
    require_actions: bool              # at least one recommended action required
    require_runbook_evidence: bool     # must cite a runbook
    forbidden_phrases: list[str]       # must NOT appear in root_cause or summary
    max_latency_ms: float | None       # must respond within this threshold

Each eval case is a JSON file — a plain incident payload plus the rubric:

{
  "id": "eval-004",
  "description": "Permission keyword should steer root cause toward credential diagnosis.",
  "input_payload": {
    "incident_id": "INC-1004",
    "run_id": "ingest_run_20250225",
    "dag_id": "customer_activity_dag",
    "severity": "medium",
    "summary": "DAG task failed with access denied error",
    "reporter": "pagerduty",
    "keywords": ["permission"]
  },
  "rubric": {
    "expected_status": "triaged",
    "min_confidence": 0.5,
    "require_actions": true,
    "forbidden_phrases": ["timed out"],
    "max_latency_ms": 5000
  }
}

The rubric says: if someone files a permission incident, the root cause should reference credentials not timeouts. In retrospect, that seems obvious. But it wasn’t.

The Bug the Harness Caught

On the first run, two eval cases failed — both permission-keyword incidents. The failure was forbidden_phrases_ok.

The root cause being returned: “Warehouse writer timed out waiting for cluster capacity; retries exhausted.”

The keyword in the incident: "permission".

Here’s what was happening. The _infer_root_cause function read the log file first, and the fixture log contained the word “timeout.” So the agent matched on that and returned the timeout diagnosis — even though the reporter explicitly flagged a permission issue. Technically right but functionally a false positive.

# Before — log text wins, keyword is ignored
def _infer_root_cause(log_text: str) -> tuple[str, float]:
    lowered = log_text.lower()
    if "timeout" in lowered:
        return "Warehouse writer timed out...", 0.82
    if "permission" in lowered:
        return "Credential or permission issue...", 0.67

The fix: treat the reporter’s keyword as the primary signal, and fall back to the log text only when no keyword applies.

# After — keyword wins
def _infer_root_cause(log_text: str, keyword: str | None = None) -> tuple[str, float]:
    kw = (keyword or "").lower()
    if "permission" in kw:
        return "Credential or permission issue detected; validate service account.", 0.67
    if "timeout" in kw or "timeout" in log_text.lower():
        return "Warehouse writer timed out...", 0.82

This is the right priority order. A human filing an incident with keyword="permission" almost certainly knows something the raw log text doesn’t capture. The agent should listen to the reporter first.

Without the eval harness, this bug would have been invisible. The agent was still returning valid, structured JSON. It still triaged the incident. It just triaged it wrong and there would have been no mechanism to catch that.

That’s exactly the problem the harness is designed to surface.

How the Pytest Runner Works

Each eval case runs N=3 times. Every run is independently scored against the rubric. The harness checks for:

Schema validity — did the agent return a valid AgentResponse at all?
Status correctness — triaged vs needs_info
Confidence floor — is the score above the minimum for this case?
Actionability — at least one recommended action present?
Evidence sourcing — did it cite a runbook when one was required?
Forbidden phrases — did the root cause avoid saying something provably wrong?
Latency — did it respond fast enough?

Running each case three times also gives you a consistency score — the fraction of runs that agree on status. For a fully deterministic agent, this should always be 100%. When a real LLM gets swapped in, this check becomes critical.

At the end of every run, the harness prints a summary table and writes a JSON report:

================================= EVAL SUMMARY =================================
Cases: 11  |  Runs: 33  |  Pass rate: 30/33 (90.9%)
Latency p95: 0.2ms  |  Total cost: $0.008400

  [PASS] eval-001    3/3 runs  avg_lat=0.4ms  consistency=100%
  [PASS] eval-004    3/3 runs  avg_lat=0.2ms  consistency=100%
  [FAIL] eval-011    0/3 runs  avg_lat=0.2ms  consistency=100%  failures=['confidence_ok', 'forbidden_phrases_ok']

Deliverable: poe eval produces a scored report and a pass rate every time.

Phase 4 — Making Failures Inspectable

Knowing that a test failed is the first step. Understanding why it failed , what exactly the agent did, which tools it called, what they returned, etc. These are what makes the system actually debuggable.

Phase 4 adds persistence and tracing so every run is stored permanently and inspectable after the fact.

The Architecture Change

The agent workflow previously returned an AgentResponse directly. That’s fine for the API, but it meant the per-tool trace data was lost after every run. I introduced a WorkflowResult wrapper:

@dataclass
class WorkflowResult:
    response: AgentResponse
    tool_details: list[dict]   # captured from WorkflowMetrics

Each tool call now records not just timing and status, but also what it was called with and what it returned:

{
    "name": "get_logs",
    "duration_ms": 0.31,
    "status": "ok",
    "retries": 0,
    "input_summary": "ingest_run_20250225",
    "output_summary": "LogResult(run_id='ingest_run_20250225', text='...')"
}

The Database

Six tables in Postgres capture everything:

Table	What it stores
`eval_cases`	Case definitions (input + rubric)
`eval_runs`	One row per individual run — timestamps, passed/failed, full agent response as JSONB
`scores`	The 7-dimension breakdown for each run
`step_traces`	Per-tool trace: name, input, output, duration, status
`tool_calls`	Raw tool call records
`model_versions`	Tracks which version of the agent produced the runs

The entire stack comes up with:

docker compose up -d
DATABASE_URL=postgresql://arl:arl@localhost:5432/arl poe eval

Two New API Endpoints

Once runs are stored, you can inspect them via the API:

GET /runs/{run_id} — returns the full agent response for that run plus the scored rubric breakdown, showing exactly which checks passed and failed.

GET /runs/{run_id}/trace — returns the ordered step trace, showing every tool the agent called, what it was given, and what came back.

For a failing run, the trace tells you the complete story: which tool ran, what input triggered the wrong behavior, and what the agent concluded from it.

Deliverable: you can look up any failing run by ID and see exactly what the agent did.

Where This Is Going

We now have:

A working agent with structured inputs and outputs
Guardrails that enforce hard limits on agent behavior
An eval harness that defines “good” and tests for it repeatedly
A persistence layer that stores every run for later inspection

The eval harness has already proven its value. It caught a real behavioral bug that would have been invisible otherwise. In a real scenario, this would have saved someone troubleshooting a failed DAG run minutes, maybe hours.

The next layer is RAG: replacing the keyword-based runbook search with semantic vector retrieval, and adding citation tracking so the agent’s evidence can be verified against source documents.

Full code on GitHub: Check it out

Building an AI Agent - Reliability Framework Part: 2

Part 2 of my article writing up my experience building a quality framework for AI Agents.