Fleet benchmark

Did this model regress for everyone?

Every receipt is a signed AI decision. Aggregated across contributing teams, the corpus tells you how each model performs in production, health, flag and loop rates, cost, and whether it regressed this week, with a data network effect that compounds as more teams contribute.

Contribute to see

The benchmark is reciprocal. Opt in at /dashboard/fleet to contribute your anonymized model metrics, and you can read the aggregate. An org that does not contribute cannot read it (the API returns 403).

What is shared (and what never is)

Shared, per model, as a mean across organizations:

The model id (provider/name) and run-health score.
Flag rate, loop rate, needs-human rate, and cost per run.
A window-over-window trend (this week vs last).

Never shared, never leaves your tenant:

Your org_id, app_id, project, or release tags.
Any prompt, output, or payload. Any end-user or tenant identifier.

How the privacy holds

Mean of org-means. Each org contributes one number per model; the published value is the average across orgs, so no org's raw figures are recoverable and a large contributor cannot dominate.
k-anonymity. A model is shown only once at least k organizations (default 3) contribute it. Below that it is suppressed (the report shows only how many were suppressed).
Tiny-contributor suppression. An org needs a minimum number of runs on a model to count toward it, so a handful of calls can never be singled out.

This is k-anonymity, not differential privacy: we are precise about the guarantee. As contributor counts grow, calibrated noise (a formal DP guarantee) is the natural next step.

The regression signal

The benchmark is keyed by model name over time. When a vendor silently updates a model behind the same name, its fleet-wide health moves, and the week-over-week trend flips to regressed across every contributor at once. That is the “this model regressed for everyone after the vendor update” signal you cannot get from your own traffic alone.

Signed report

GET /api/v1/fleet/benchmark returns the report with an Ed25519 signature, so it is a tamper-evident artifact: a vendor cannot dispute that a regression came from Prova's corpus, and an auditor can verify the signature with the public key without trusting Prova's servers.

Read it in the dashboard.