Run health

Know which runs to trust.

Observability shows you traces. Run health gives you a verdict. Every agent run gets a 0 to 100 health score and a letter grade, read straight from the signals already in your receipts. No eval set, no labels, no LLM.

The score and the grade

A run starts at 100. Each signal that fires subtracts a fixed penalty. The remaining score maps to a letter grade:

A 90 to 100B 80 to 89C 70 to 79D 60 to 69F below 60

The signals

Every point off is traceable to one named signal, so a poor grade tells you what went wrong, not just that something did. All eight are computed from the run's receipts. The dashboard shows each fired signal with its penalty and detail.

SignalOff
coordination_loop-45Agents repeated themselves in a cycle without making progress.
gateway_blocked-25The gateway blocked a call this run attempted.
severe_finding-20 to 30A high (20) or critical (30) severity detector finding fired.
no_progress-20An agent cycle revisited the same states without progressing (the early-warning band below the full loop detector).
step_blowup-8 to 15Steps ran over the declared max_steps (15), or over the default ceiling when none was declared (8).
cost_blowup-8 to 15Cost ran over the declared budget (15), or over the default ceiling when none was declared (8).
repeated_tool_call-12The same tool was called repeatedly with the same input or output.
medium_finding-10A medium-severity detector finding fired.

The routing: pass, flag, or needs-you

The score drives a three-way routing so you triage by exception:

  • auto-pass when no signal fired. The run is clean and clears without anyone looking.
  • flagged when the run is clearly broken: it looped, a call was blocked, or a high or critical finding fired.
  • needs you for the ambiguous middle: something fired but nothing decisive, so a human decides.

Deterministic by design

The scorer is pure: no network, no clock, no model call. The same input always produces the same score, and the Python and Node ports match the server. It answers whether the run was healthy (looped, blew budget, stalled, tripped a detector), not whether the final answer was semantically correct. That line is deliberate: a semantic judge would need an LLM, and this layer stays deterministic so it can run offline and identically everywhere.

Three ways to get it

The same score, no account required to start.

1. Offline CLI

terminalbash
# events the SDK emits, or raw vendor logs mapped offline first
prova-local --file run.ndjson
prova-local --file langsmith-export.ndjson --source langsmith

2. In the SDK

pythonpython
from prova_cp import run_health

health = run_health(events)   # { score, grade, signals, needs_human, summary }
nodets
import { runHealth } from '@cobound/prova-sdk';

const health = runHealth(events);   // same shape, byte-for-byte with the server

3. On the dashboard

Once you ingest, /dashboard/health grades every run over your window, sorts the ones that need you first, and shows the signals behind each grade. Free on every plan.

Gate deploys on regression

Tag each run with the deploy that produced it (the SDK reads PROVA_RELEASE automatically), then ask whether the new release regressed against the last good one. The check is deterministic and label-free; it compares health, flag rate, loop rate, and cost between the two releases with a confidence interval behind every verdict, and stays quiet below a minimum run count so it never fails a build on noise.

terminalbash
# fails (exit 1) only when the candidate release actually regressed
PROVA_API_KEY=prv_... prova-eval \
  --app-id claims-agent --baseline "$LAST_GOOD_SHA" --candidate "$GIT_SHA"
github actionsyaml
# .github/workflows/agent-eval.yml
- name: Gate on agent regression
  env:
    PROVA_API_KEY: ${{ secrets.PROVA_API_KEY }}
  run: npx --yes @cobound/prova-sdk prova-eval \
    --app-id claims-agent \
    --baseline ${{ vars.LAST_GOOD_RELEASE }} \
    --candidate ${{ github.sha }}

The endpoint returns HTTP 422 on a regression and the CLI exits 1, so the step fails the deploy. The same call is available as GET /api/v1/eval/compare for other CI systems.

Answer quality: the opt-in judge

Run health and the regression gate are deterministic and answer whether the run was healthy. To also ask whether the answer was good, turn on the LLM judge. It is off by default; a stock deployment makes no judge call and the gate stays reproducible.

  • Auditable by design. Every judgement is written as a signed quality_eval receipt: the model, a per-dimension breakdown (correctness, grounding, completeness, instruction following), quoted evidence, and a confidence. The evaluator is on the same tamper-evident record as everything else, so an auditor can check what the judge did and that it was not altered.
  • Not a vibe number. The overall score is the worst dimension, the judge cites evidence, and it abstains when it cannot tell. Abstained runs are excluded from the metric rather than guessed.
  • Same gate. The judge writes the scores; the regression engine reads them as a quality metric, so "did the deploy make the answers worse?" runs through the exact comparison and CI gate as everything else, and the gate itself stays deterministic.
terminalbash
# enable on the server, then score a release's runs (writes signed quality_eval receipts)
export PROVA_QUALITY_JUDGE=1 ANTHROPIC_API_KEY=sk-...
curl -X POST -H "Authorization: Bearer prv_..." \
  "https://prova.cobound.dev/api/v1/eval/judge?app_id=claims-agent&release=$GIT_SHA"
# then 'quality' appears as a metric in /api/v1/eval/compare and the dashboard

Outcomes: the ground truth

Health is a heuristic and the judge is a proxy. The strongest signal is what a human or downstream system actually did with the output. Record it and acceptance becomes a first-class, auditable regression metric: a signed outcome receipt references the run, so "a reviewer accepted this and here is the correction" is on the tamper-evident record (the answer to EU AI Act human-oversight evidence).

nodets
// thumbs-down from a user, or a reviewer override, on a specific run
await prova.feedback(runId, 'corrected', {
  source: 'reviewer',
  correctedOutput: 'The deductible is $1,500, not $1,000.',
});
http (no SDK)bash
curl -X POST -H "Authorization: Bearer prv_..." -H "content-type: application/json" \
  https://prova.cobound.dev/api/v1/feedback \
  -d '{"run_id":"run-42","verdict":"accepted"}'

The regression view then reads acceptance_rate and override_rate between releases, so "did the deploy lower acceptance?" runs through the same gate. Because the judge score and the human outcome land on the same runs, the judge can be calibrated against ground truth over time.

Matched-input A/B (probe sets, pairwise)

Comparing production runs is directional but noisy, because each release sees different inputs. For a clean A/B, run a fixed set of probe inputs against each release and tag them with a stable probe_id (alongside PROVA_RELEASE). Prova pairs the runs by probe id and asks the judge, per pair, which output is better. Pairwise preference is far more reliable than absolute scoring, and the verdict is an exact sign test over the wins and losses, so a small curated set still gives a trustworthy answer.

terminalbash
# after running your probe set on both releases (tagged with probe_id + release)
curl -H "Authorization: Bearer prv_..." \
  "https://prova.cobound.dev/api/v1/eval/pairwise?app_id=claims-agent&baseline=v36&candidate=v37"
# -> { matchedProbes, wins, losses, ties, pValue, verdict }  (422 on a regression)

Ties are dropped from the test, failed judgements shrink the sample rather than skew it, and below a minimum number of decisive pairs the verdict is insufficient_data rather than a guess.

Related