ProvaBench

How often do frontier models actually reason validly?

Every week we run Claude, GPT, Gemini, Llama, and Mistral against a fixed corpus of reasoning prompts and score every chain-of-thought with a formal Prova certificate. No LLM judge. No hidden rubric. Every result is backed by a public certificate you can inspect.

Latest run: 2026-04-20

models covered

2

scored in the latest run

prompts evaluated

40

aggregated across all models

top score

0.0%

of chains verified valid

rankmodelscorefailure mixn
1anthropic/claude-opus-4-70.0%
20
2anthropic/claude-sonnet-4-60.0%
20
circular contradiction unsupported leap

Not another LLM leaderboard

MMLU, GPQA, and most public benchmarks score whether the final answer is correct. A model can reach the right answer with circular reasoning, unsupported leaps, or internal contradictions. ProvaBench scores the reasoning itself, not the guess at the end.

Competing runtime guardrails like Guardrails AI, Lakera, and NeMo Guardrails enforce rule-based policies; they are not reasoning verifiers. ProvaBench scores every chain with formal Prova certificates you can inspect yourself.

Full methodology · all results are backed by public Prova certificates. Click any model to drill into its worst failure.