Why I built Prova
A multi-agent system silently went in circles for three days while we paid for it. That's the problem Prova catches.
A few months ago a team I was advising had a multi-agent research pipeline running in production. Six agents -- an orchestrator, a data fetcher, an analysis layer, a QA reviewer, a report writer, and a coordinator. The system was supposed to summarize quarterly revenue across APAC markets. It had been "working" for three days.
It wasn't working. It was looping.
The orchestrator kept telling the data agent to pull fresh data. The data agent pulled the same dataset under a slightly different name. The analysis agent flagged a methodology issue. The QA agent blocked progress on methodology. The orchestrator interpreted "blocked on methodology" as a request to expand the scope, and told the data agent to pull more data. The data agent pulled the same dataset under another slightly different name.
Nothing was wrong with any individual agent. Each one made a locally reasonable decision. The failure was at the coordination level -- the same handoff pattern repeated until someone noticed the bill.
That's the failure mode every team building multi-agent systems eventually hits. Not a single agent hallucinating. Not a tool call timing out. A coordination loop where the system as a whole is doing work but going nowhere.
Why this is hard to catch
LangSmith, Langfuse, and the rest of the observability tools will show you every span. You can scroll through and see every individual call. You will not notice the loop. Loops aren't visible at the span level -- they're visible at the graph level. You have to look at how state flows between agents over many steps.
We tried building a script in-house that scanned execution logs for repeated patterns. It mostly worked. It also produced enough false positives that nobody read its alerts. Repeating a similar prompt isn't necessarily a loop. Repeating it in a closed cycle of agents who can't make progress is.
The math we ended up using
It turns out there's a well-studied mathematical structure for "is this graph stuck in a cycle that won't resolve" -- it's called persistent homology. You build a simplicial complex from the agent communication graph, you watch how its first homology group evolves over time, and you flag the moment a non-trivial 1-cycle persists. That's the loop. The math is precise about when something is a real cycle versus transient noise.
I'm not going to do the math here. The point isn't the math. The point is that what felt vague and hard to detect by inspection -- "is this thing stuck?" -- has a precise definition once you look at it the right way.
What Prova actually does
Prova watches the runtime state of your agent system. Every read and write from your agents flows through it. When a coordination loop forms -- the same handoff pattern persists past the point of plausible progress -- you get a Slack alert, a webhook, and a tamper-evident receipt: the exact agents stuck, the step the loop started, and a hash you can hand to an auditor.
That's it. One line to install. No infrastructure changes. No new database. No new vendor lock-in.
If you're building anything with LangGraph or CrewAI or your own agent runtime -- if you've ever spent an afternoon staring at a trace trying to figure out why your agents won't terminate -- I'd love to show you what we caught on your system. Book a call, or try the live demo.