Methodology

How ProvaBench scores reasoning validity

Every score on the leaderboard is backed by a public Prova certificate. No LLM judge, no hidden rubric, no cherry-picked prompts. This page explains exactly what we measure, where the corpus comes from, and how you reproduce any number on the board.

What we measure

Reasoning validity. Whether a model's chain-of-thought is structurally sound as a formal argument. We do not measure factual correctness. A model can reach the right answer with invalid reasoning, or the wrong answer with impeccable logic; ProvaBench catches the reasoning problem either way.

Validity is evaluated against the same failure registry that powers production Prova: circular reasoning, internal contradiction, and unsupported leap. A chain with zero of those failures is VALID; a chain with one or more is INVALID, annotated with the failure type.

Corpus and sources

The ProvaBench corpus combines 20 hand-curated prompts across five reasoning categories with items drawn from established academic benchmarks. Each prompt has an expected verdict (valid or invalid reasoning) published alongside it. The corpus is versioned in the repository and grows toward 500 prompts over time.

Logic puzzles
BIG-Bench Lite →
Syllogisms, knights-and-knaves, modus tollens contrapositives. We use the logic-puzzle shards and grade on reasoning chain, not final answer.
Multi-step math
GSM8k →
Grade-school word problems requiring 2 to 8 reasoning steps. We sample both positive and adversarial variants (e.g. Linda-style conjunction fallacy).
Legal reasoning
LegalBench →
Stanford HAI collaborative benchmark of 162 legal-reasoning tasks. ProvaBench uses the IRAC-structured subset (offer/acceptance, hearsay, mens rea, standing).
Scientific inference
SciBench →
College-level scientific reasoning across physics, chemistry, and biology. We use the inference-chain subset, not numeric-answer questions.
Causal-everyday
CLadder →
Causal-reasoning benchmark built on Pearl's causal hierarchy. Covers correlation-vs-causation traps, Simpson's paradox, and broken-window fallacy style prompts.
Formal foundation
H1(K; Z) = 0 iff the argument is structurally valid →
The validity check is the simplicial-homology result that a free argument skeleton has trivial first homology. Proof and Lean 4 formalization in the cobound-validator repository.

Scoring pipeline

For each (model, prompt) we issue a single completion with the system instruction "Think step by step. Show your reasoning explicitly." Temperature is the provider default. No few-shot examples.
The response is piped through the production Prova pipeline: extraction (Claude identifies premises, claims, dependencies) → graph_builder (converts to a directed argument graph) → cobound-validator (runs the H1(K;Z) = 0 check) → certificate (PRV-YYYY-XXXX with SHA-256 integrity hash).
Observed verdict is compared to expected. Correct = 1, incorrect = 0. Score = correct / total. The failure mix reports the share of the three reasoning-failure types among INVALID chains.
Every run persists its certificate permanently. Click any model on the leaderboard to drill into its worst failure and trace the exact graph that produced the verdict.

Model access

Models are dispatched through the Vercel AI Gateway when AI_GATEWAY_API_KEY is configured, and otherwise via the direct provider SDK (Anthropic, OpenAI, Google, xAI). Models the runner cannot reach are skipped rather than listed with zero scores, so the leaderboard always reflects real runs against real endpoints.

Reproducibility

The runner, prompts, and certificate format are open. Clone the repository, set provider keys, and run python -m prova.bench.runner to regenerate the leaderboard from scratch. Every certificate on the board has a stable URL you can share and a tamper-evident hash you can verify locally.

Found a bug in the corpus, a missing category, or a model we should add? Open an issue on the repo. We rerun the full board weekly at 00:00 UTC every Monday.

← Back to leaderboard · What is Prova?