The Eval Harness Caught a Real Bug on Day One
In my previous post, I built the agent and the safety framework around it which included contracts, guardrails, metrics. The agent ran. It returned structured JSON every time. It failed safely when things went wrong.
But “it runs” is not the same as “it behaves correctly.”
Phase 3 is where I stopped trusting the output and started measuring it.
What an Eval Harness Actually Is
The concept is simple: define a set of test scenarios, define what a good response looks like for each one, run the agent against them repeatedly, and score the results.
The tricky part is defining “good.” For an AI agent, good isn’t a single assertion. It’s a profile of behaviors across multiple dimensions. I structured this as a rubric attached to each eval case:
class EvalRubric(BaseModel):
expected_status: Literal["triaged", "needs_info"] | None
min_confidence: float | None # response confidence must be >= this
require_actions: bool # at least one recommended action required
require_runbook_evidence: bool # must cite a runbook
forbidden_phrases: list[str] # must NOT appear in root_cause or summary
max_latency_ms: float | None # must respond within this threshold
Each eval case is a JSON file — a plain incident payload plus the rubric:
{
"id": "eval-004",
"description": "Permission keyword should steer root cause toward credential diagnosis.",
"input_payload": {
"incident_id": "INC-1004",
"run_id": "ingest_run_20250225",
"dag_id": "customer_activity_dag",
"severity": "medium",
"summary": "DAG task failed with access denied error",
"reporter": "pagerduty",
"keywords": ["permission"]
},
"rubric": {
"expected_status": "triaged",
"min_confidence": 0.5,
"require_actions": true,
"forbidden_phrases": ["timed out"],
"max_latency_ms": 5000
}
}
The rubric says: if someone files a permission incident, the root cause should reference credentials not timeouts. In retrospect, that seems obvious. But it wasn’t.
The Bug the Harness Caught
On the first run, two eval cases failed — both permission-keyword incidents. The failure was forbidden_phrases_ok.
The root cause being returned: “Warehouse writer timed out waiting for cluster capacity; retries exhausted.”
The keyword in the incident: "permission".
Here’s what was happening. The _infer_root_cause function read the log file first, and the fixture log contained the word “timeout.” So the agent matched on that and returned the timeout diagnosis — even though the reporter explicitly flagged a permission issue. Technically right but functionally a false positive.
# Before — log text wins, keyword is ignored
def _infer_root_cause(log_text: str) -> tuple[str, float]:
lowered = log_text.lower()
if "timeout" in lowered:
return "Warehouse writer timed out...", 0.82
if "permission" in lowered:
return "Credential or permission issue...", 0.67
The fix: treat the reporter’s keyword as the primary signal, and fall back to the log text only when no keyword applies.
# After — keyword wins
def _infer_root_cause(log_text: str, keyword: str | None = None) -> tuple[str, float]:
kw = (keyword or "").lower()
if "permission" in kw:
return "Credential or permission issue detected; validate service account.", 0.67
if "timeout" in kw or "timeout" in log_text.lower():
return "Warehouse writer timed out...", 0.82
This is the right priority order. A human filing an incident with keyword="permission" almost certainly knows something the raw log text doesn’t capture. The agent should listen to the reporter first.
Without the eval harness, this bug would have been invisible. The agent was still returning valid, structured JSON. It still triaged the incident. It just triaged it wrong and there would have been no mechanism to catch that.
That’s exactly the problem the harness is designed to surface.
How the Pytest Runner Works
Each eval case runs N=3 times. Every run is independently scored against the rubric. The harness checks for:
- Schema validity — did the agent return a valid
AgentResponseat all? - Status correctness —
triagedvsneeds_info - Confidence floor — is the score above the minimum for this case?
- Actionability — at least one recommended action present?
- Evidence sourcing — did it cite a runbook when one was required?
- Forbidden phrases — did the root cause avoid saying something provably wrong?
- Latency — did it respond fast enough?
Running each case three times also gives you a consistency score — the fraction of runs that agree on status. For a fully deterministic agent, this should always be 100%. When a real LLM gets swapped in, this check becomes critical.
At the end of every run, the harness prints a summary table and writes a JSON report:
================================= EVAL SUMMARY =================================
Cases: 11 | Runs: 33 | Pass rate: 30/33 (90.9%)
Latency p95: 0.2ms | Total cost: $0.008400
[PASS] eval-001 3/3 runs avg_lat=0.4ms consistency=100%
[PASS] eval-004 3/3 runs avg_lat=0.2ms consistency=100%
[FAIL] eval-011 0/3 runs avg_lat=0.2ms consistency=100% failures=['confidence_ok', 'forbidden_phrases_ok']
Deliverable: poe eval produces a scored report and a pass rate every time.
Phase 4 — Making Failures Inspectable
Knowing that a test failed is the first step. Understanding why it failed , what exactly the agent did, which tools it called, what they returned, etc. These are what makes the system actually debuggable.
Phase 4 adds persistence and tracing so every run is stored permanently and inspectable after the fact.
The Architecture Change
The agent workflow previously returned an AgentResponse directly. That’s fine for the API, but it meant the per-tool trace data was lost after every run. I introduced a WorkflowResult wrapper:
@dataclass
class WorkflowResult:
response: AgentResponse
tool_details: list[dict] # captured from WorkflowMetrics
Each tool call now records not just timing and status, but also what it was called with and what it returned:
{
"name": "get_logs",
"duration_ms": 0.31,
"status": "ok",
"retries": 0,
"input_summary": "ingest_run_20250225",
"output_summary": "LogResult(run_id='ingest_run_20250225', text='...')"
}
The Database
Six tables in Postgres capture everything:
| Table | What it stores |
|---|---|
eval_cases |
Case definitions (input + rubric) |
eval_runs |
One row per individual run — timestamps, passed/failed, full agent response as JSONB |
scores |
The 7-dimension breakdown for each run |
step_traces |
Per-tool trace: name, input, output, duration, status |
tool_calls |
Raw tool call records |
model_versions |
Tracks which version of the agent produced the runs |
The entire stack comes up with:
docker compose up -d
DATABASE_URL=postgresql://arl:arl@localhost:5432/arl poe eval
Two New API Endpoints
Once runs are stored, you can inspect them via the API:
GET /runs/{run_id} — returns the full agent response for that run plus the scored rubric breakdown, showing exactly which checks passed and failed.
GET /runs/{run_id}/trace — returns the ordered step trace, showing every tool the agent called, what it was given, and what came back.
For a failing run, the trace tells you the complete story: which tool ran, what input triggered the wrong behavior, and what the agent concluded from it.
Deliverable: you can look up any failing run by ID and see exactly what the agent did.
Where This Is Going
We now have:
- A working agent with structured inputs and outputs
- Guardrails that enforce hard limits on agent behavior
- An eval harness that defines “good” and tests for it repeatedly
- A persistence layer that stores every run for later inspection
The eval harness has already proven its value. It caught a real behavioral bug that would have been invisible otherwise. In a real scenario, this would have saved someone troubleshooting a failed DAG run minutes, maybe hours.
The next layer is RAG: replacing the keyword-based runbook search with semantic vector retrieval, and adding citation tracking so the agent’s evidence can be verified against source documents.
Full code on GitHub: Check it out