Product documentation

Gold eval — regression suite

The 30-flow ground-truth set that gates every extractor change before it can be merged.

Convoship's import pipeline turns Draw.io, PDF, Word, and image sources into a runnable agent. Because the pipeline includes an LLM pass, drift is a real risk: a model upgrade, a prompt tweak, or a parser change can silently degrade quality on real customer flows. The gold-eval harness exists to catch that drift before it ships.

What it is

  • Thirty real-world conversational flows (loan applications, return policies, triage trees, IT runbooks, hospitality intake, etc.) hand-curated as ground truth.
  • Each gold flow has a source file (Draw.io, PDF, image) and its canonical extracted JSON.
  • The packages/eval harness runs the deterministic extractor over the suite and compares the output against ground truth.
  • Reported metrics: node F1, edge F1, intent F1, slot F1, and per-source-type latency.

How it gates releases

The Extraction Eval job in .github/workflows/ci.yml runs the suite on every PR and every push to main. Node F1 ≥ 0.99 is a release blocker — a PR that regresses the suite cannot merge. This is not a best-effort metric; it's a checked-in CI gate.

# Run the eval suite locally
uv pip install --system -e packages/schema/python
uv pip install --system -e apps/studio-api
uv pip install --system -e packages/eval
convoship-eval

Exporting enriched cases

packages/eval can export enriched gold cases (with full extraction telemetry — model used, tokens, repair iterations, validator findings) so the suite can be re-run against a frozen baseline. Use this when validating a model upgrade.

Adding a new case

Drop the source file in packages/eval/gold/, add the canonical JSON alongside it, and run convoship-eval --update-baseline locally. Commit both files in the same PR with a short note on what the case is testing.

What it does and does not cover

Gold eval is the regression gate for the import pipeline. It does NOT replace per-agent evals on the live agent. Each agent in production can carry its own intent eval set (Workspace → Eval) — that's where production drift is measured against real conversation logs.