Playground & evals

Test agentic turns with SSE streaming, traces, regression evals, and version eval history.

Use Playground to validate draft config before publish. Use Evals for repeatable regression checks; optionally gate publish with a workspace pass-rate threshold.

Playground

  • Route: /app/agents/{slug}/playground.
  • POST /v1/ai-agents/{slug}/playground — JSON response by default.
  • POST .../playground?stream=true — Server-Sent Events with event: token, trace, done.
  • Studio UI streams tokens into the bot bubble when streaming is enabled.
  • Each turn persists AiAgentConversation + trace; link to full trace under Conversations.
  • channel playground is always allowed even if Web is the only listed channel.
  • BUDGET guardrail blocks LLM when daily_spend_cap_cents exceeded.
  • RPM — same workspace rpm_cap_per_agent as public embed (0 = unlimited).

Trace panel

Event kindMeaning
userInbound message
thinkModel reasoning step
toolDispatch result — name, args, output, duration_ms
askClarifying question to user
guardrailAUTH / LIMIT / BUDGET / REDACT / ESCALATE hit

Evals tab

  1. Create eval — name, user prompt, expected outcome text.
  2. Run — POST .../evals/{id}/run executes playground + judge_eval (LLM JSON verdict or substring fallback).
  3. Counters — pass_count, fail_count, last_status on the eval row.
  4. Publish trend — bar chart from version eval_pass_rate (click a bar for per-version runs).

LLM judge

judge_eval returns { passed, rationale }. Uses workspace BYOK when configured; otherwise platform keys. Without keys, substring match on expected text.

Conversations

/app/agents/{slug}/convos lists playground and embed traffic. Detail view shows chronological trace with tool timing — same schema as public production conversations.

After evals pass in Playground, set eval_pass_threshold under Workspace Settings and publish from Deploy — eval runs are snapshotted on the version for audit.