Building an Eval Harness: From First Test to Regression Suite

OpenClaw Academy · Part 1, Issue 08 · Deep dive

Jun 11, 2026

∙ Paid

Issue 07 introduced the three evaluation approaches conceptually. This issue implements them. By the end of this issue, you’ll have a working eval harness that you can run in under 60 seconds, that exits non-zero on any regression, and that you can extend with one new test at a time.

Start small. One assertion test is infinitely better than zero.

Why eval harnesses fail before they start

Most engineers who try to build an eval harness for their agent quit before they finish, for three consistent reasons:

They scope it too large. “I’ll evaluate every skill across every possible input.” This is correct in principle and impossible in practice. The right scope: start with one skill, one happy-path test, and one error-path test. Two tests. Run them. Add more when you have time.

They try to evaluate open-ended outputs. “I’ll evaluate whether the morning brief is good.” Good is not a testable criterion. Reframe: “I’ll test whether the morning brief includes today’s date, contains at least one calendar event if events exist, and is under 300 words.” Testable. Automatable.

They don’t connect eval to their upgrade workflow. An eval harness that runs manually sometimes is much less valuable than one that runs automatically before every upgrade. Connect it to your workflow on day one.

The eval harness structure

A minimal but complete eval harness has three components:

Input fixtures — the test inputs. Real messages that real users send. Start with the 5–10 messages you send most often to your own agent. These are your regression tests: if the agent handles these correctly, it handles the common cases correctly.

Expected outputs — what correct looks like. For structured outputs: the JSON schema. For prose outputs: a checklist of required elements. For action outputs: the sequence of tool calls that should happen.

Assertions — the checks. Some are exact match (the output is valid JSON). Some are structural (the JSON has these required keys). Some require an LLM judge (the summary is accurate and relevant).

Continue reading this post for free, courtesy of AI Engineering.

Or purchase a paid subscription.