Run health
Know which runs to trust.
Observability shows you traces. Run health gives you a verdict. Every agent run gets a 0 to 100 health score and a letter grade, read straight from the signals already in your receipts. No eval set, no labels, no LLM.
The score and the grade
A run starts at 100. Each signal that fires subtracts a fixed penalty. The remaining score maps to a letter grade:
The signals
Every point off is traceable to one named signal, so a poor grade tells you what went wrong, not just that something did. All eight are computed from the run's receipts. The dashboard shows each fired signal with its penalty and detail.
The routing: pass, flag, or needs-you
The score drives a three-way routing so you triage by exception:
- auto-pass when no signal fired. The run is clean and clears without anyone looking.
- flagged when the run is clearly broken: it looped, a call was blocked, or a high or critical finding fired.
- needs you for the ambiguous middle: something fired but nothing decisive, so a human decides.
Deterministic by design
The scorer is pure: no network, no clock, no model call. The same input always produces the same score, and the Python and Node ports match the server. It answers whether the run was healthy (looped, blew budget, stalled, tripped a detector), not whether the final answer was semantically correct. That line is deliberate: a semantic judge would need an LLM, and this layer stays deterministic so it can run offline and identically everywhere.
Three ways to get it
The same score, no account required to start.
1. Offline CLI
# events the SDK emits, or raw vendor logs mapped offline first
prova-local --file run.ndjson
prova-local --file langsmith-export.ndjson --source langsmith2. In the SDK
from prova_cp import run_health
health = run_health(events) # { score, grade, signals, needs_human, summary }import { runHealth } from '@cobound/prova-sdk';
const health = runHealth(events); // same shape, byte-for-byte with the server3. On the dashboard
Once you ingest, /dashboard/health grades every run over your window, sorts the ones that need you first, and shows the signals behind each grade. Free on every plan.
Gate deploys on regression
Tag each run with the deploy that produced it (the SDK reads PROVA_RELEASE automatically), then ask whether the new release regressed against the last good one. The check is deterministic and label-free; it compares health, flag rate, loop rate, and cost between the two releases with a confidence interval behind every verdict, and stays quiet below a minimum run count so it never fails a build on noise.
# fails (exit 1) only when the candidate release actually regressed
PROVA_API_KEY=prv_... prova-eval \
--app-id claims-agent --baseline "$LAST_GOOD_SHA" --candidate "$GIT_SHA"# .github/workflows/agent-eval.yml
- name: Gate on agent regression
env:
PROVA_API_KEY: ${{ secrets.PROVA_API_KEY }}
run: npx --yes @cobound/prova-sdk prova-eval \
--app-id claims-agent \
--baseline ${{ vars.LAST_GOOD_RELEASE }} \
--candidate ${{ github.sha }}The endpoint returns HTTP 422 on a regression and the CLI exits 1, so the step fails the deploy. The same call is available as GET /api/v1/eval/compare for other CI systems.
Answer quality: the opt-in judge
Run health and the regression gate are deterministic and answer whether the run was healthy. To also ask whether the answer was good, turn on the LLM judge. It is off by default; a stock deployment makes no judge call and the gate stays reproducible.
- Auditable by design. Every judgement is written as a signed
quality_evalreceipt: the model, a per-dimension breakdown (correctness, grounding, completeness, instruction following), quoted evidence, and a confidence. The evaluator is on the same tamper-evident record as everything else, so an auditor can check what the judge did and that it was not altered. - Not a vibe number. The overall score is the worst dimension, the judge cites evidence, and it abstains when it cannot tell. Abstained runs are excluded from the metric rather than guessed.
- Same gate. The judge writes the scores; the regression engine reads them as a
qualitymetric, so "did the deploy make the answers worse?" runs through the exact comparison and CI gate as everything else, and the gate itself stays deterministic.
# enable on the server, then score a release's runs (writes signed quality_eval receipts)
export PROVA_QUALITY_JUDGE=1 ANTHROPIC_API_KEY=sk-...
curl -X POST -H "Authorization: Bearer prv_..." \
"https://prova.cobound.dev/api/v1/eval/judge?app_id=claims-agent&release=$GIT_SHA"
# then 'quality' appears as a metric in /api/v1/eval/compare and the dashboardOutcomes: the ground truth
Health is a heuristic and the judge is a proxy. The strongest signal is what a human or downstream system actually did with the output. Record it and acceptance becomes a first-class, auditable regression metric: a signed outcome receipt references the run, so "a reviewer accepted this and here is the correction" is on the tamper-evident record (the answer to EU AI Act human-oversight evidence).
// thumbs-down from a user, or a reviewer override, on a specific run
await prova.feedback(runId, 'corrected', {
source: 'reviewer',
correctedOutput: 'The deductible is $1,500, not $1,000.',
});curl -X POST -H "Authorization: Bearer prv_..." -H "content-type: application/json" \
https://prova.cobound.dev/api/v1/feedback \
-d '{"run_id":"run-42","verdict":"accepted"}'The regression view then reads acceptance_rate and override_rate between releases, so "did the deploy lower acceptance?" runs through the same gate. Because the judge score and the human outcome land on the same runs, the judge can be calibrated against ground truth over time.
Matched-input A/B (probe sets, pairwise)
Comparing production runs is directional but noisy, because each release sees different inputs. For a clean A/B, run a fixed set of probe inputs against each release and tag them with a stable probe_id (alongside PROVA_RELEASE). Prova pairs the runs by probe id and asks the judge, per pair, which output is better. Pairwise preference is far more reliable than absolute scoring, and the verdict is an exact sign test over the wins and losses, so a small curated set still gives a trustworthy answer.
# after running your probe set on both releases (tagged with probe_id + release)
curl -H "Authorization: Bearer prv_..." \
"https://prova.cobound.dev/api/v1/eval/pairwise?app_id=claims-agent&baseline=v36&candidate=v37"
# -> { matchedProbes, wins, losses, ties, pValue, verdict } (422 on a regression)Ties are dropped from the test, failed judgements shrink the sample rather than skew it, and below a minimum number of decisive pairs the verdict is insufficient_data rather than a guess.