PolybrainBench.
An open public benchmark from Polylogic AI that measures how AI systems disagree. We ask the same question to nine different AIs at once — ChatGPT, Grok, Llama, Kimi, Qwen, and others — and publish exactly where their answers diverge.
This is the research. The same team ships Polybrain, our agent platform, built on the same verification primitives.
We asked 9 AI systems the same question: “Which planet has the coldest atmosphere in our solar system?”
Neptune
Uranus
Hover any model name above to read its exact answer.
10,452 questions asked · 11 AI systems · 7 AI companies · Free to read, cite, and build on
A living benchmark for how AI models disagree.
Most benchmarks measure whether a single AI can answer a question correctly. PolybrainBench measures something different: when you ask the same question to nine different AIs, where do they disagree, and how big is the gap?
Every verification cycle dispatches one claim to 9independently trained models in parallel. Every response is captured with its full text, its timing, and a cryptographic provenance stamp. No model sees any other model’s output during the cycle. No reviewer grades the cycle while it’s in flight. The pattern of who takes which position IS the data the benchmark publishes.
Every N cycles the daemon regenerates a paper from the current ledger and validates it against a separate six-model reviewer fleet. Two of those six are external anchors from provider families absent from the generator (Anthropic and Google), so their scores are fully independent. The other four are drawn from the generator to preserve the character of its strictest and most generous voices. Eleven unique models across seven independent training lineages touch the pipeline end to end. The paper regenerates and republishes itself.
The paper is free to read, cite, and build on. 10,452 verification cycles in the published dataset, growing every day. Hosted on Zenodo with a permanent identifier; the full ledger is on Hugging Face.
When AIs disagree, that’s where uncertainty actually lives.
You already experience this. You ask Claude and GPT the same factual question and get different answers. You check Wikipedia against what an AI told you and find quiet discrepancies. Two well-aligned models can deliver contradictory confident statements about the same claim, and neither one will tell you which is right.
That gap is the measurement. And until now, no public benchmark was capturing it systematically.
Every existing LLM benchmark takes a different approach: it tests one model at a time against a fixed answer key and publishes an accuracy percentage. That measures something real, but it doesn’t tell you what actually happens in practice, which is that you’re using several AIs at once and they keep giving you different answers to the same questions.
PolybrainBench captures the full response from nine independently trained AIs for every claim, across four separate commercial providers. When all nine agree, the claim is probably settled knowledge. When they split, the claim is in a zone where no single AI should be trusted without external evidence. The disagreement pattern is the signal no single model can give you, and the signal no accuracy benchmark can give you either.
If you publish research
A public corpus of cross-model disagreement.
Nine full responses per claim, four providers, CC-BY-4.0. Cite the DOI. Download the JSONL. Train on it, analyze it, disagree with it.
If you build with AI
A reliability signal for when to doubt.
Check the canonical page for a claim before you act on a model’s answer. If nine AIs already disagreed on it, you probably should too.
If you’re just reading
A window into what AIs don’t actually know.
Browse any of the 7,004 claim pages and watch nine models try to agree. The ones where they split are the interesting ones.
The same claim. Nine models. In parallel. No answer key.
Every verification cycle is one claim, dispatched in parallel to all nine fleet models. The full response text from each model is captured and stamped with a SHA-256 provenance hash. Per-model response time is recorded in milliseconds. No model sees any other model’s output during the cycle. No reviewer grades the cycle while it’s in flight.
There is no hidden answer key. The benchmark does not assume any particular position on any claim is correct. It measures the pattern of which models take which position, and publishes the disagreement as the data.
Every N cycles the daemon regenerates the paper from the current ledger and validates the paper against a six-model reviewer fleet. Two of the six are external anchors from provider families absent from the generator fleet: claude-sonnet-4-5 from Anthropic and gemini-2.5-pro from Google. They have no corpus contribution at all, so their scores are fully independent of the training lineages that produced the data. The other four reviewers are drawn from the generator fleet to preserve its strictest and most generous voices, and they grade the paper’s aggregate analysis of the full ledger rather than their own isolated per-claim outputs.
The living paper is always the current canonical artifact. No threshold gates publication. The composite score is a property of the published paper, not a precondition for it. The Matthew Effect is the only gate.
Growing by about two thousand cycles a day.
At paper v16 the corpus is a snapshot of a living artifact, not a fixed release. The measurement is the full response text from every model on every claim, not a summary or an accuracy score. The corpus is designed to grow: every 6 hours a scheduled daemon adds about five hundred new cycles, targeted at claim shapes where models are most likely to disagree.
10,452
verification cycles
at paper v16
94,068
full model responses
text + timing + provenance
~2,000
cycles per day
autonomous cron
72
honest composite
Q 75 · A 67
Topic selection is adversarial-by-design.The topic generator rejects consensus trivia and targets claim shapes where models are most likely to diverge: non-round specific numbers, contested historical dates, cross-field technical definitions, near-training-cutoff events, named standards with effective dates, and common misconceptions stated flatly in their corrected form. “Paris is the capital of France” is forbidden; “The SI redefinition of the kilogram took effect on May 20, 2019, based on a fixed Planck constant of 6.62607015 × 10³⁻⁴ joule-seconds” is the shape we generate.
Real token cost is captured from the API response usage fields. A validator run on paper v16cost approximately $0.040 in real API tokens across the six-model disjoint reviewer fleet ($0.026 Claude Sonnet 4.5 + $0.011 Gemini 2.5 Pro + $0.003 across the in-fleet OpenAI and xAI reviewers; Groq reviewers are free-tier). The dataset is not just citable, it’s auditable: you can check what every measurement cost to produce.
Eleven models. Seven training lineages. Five API providers.
Two separate fleets do two separate jobs. The generator fleet of nine models dispatches against every claim in the corpus. The reviewer fleet of six models validates the paper about that corpus. Four reviewers are drawn from the generator to preserve its strictest and most generous voices; they grade the paper’s aggregate analysis of the full ledger rather than their own isolated per-claim responses. Two reviewers are external anchors — claude-sonnet-4-5 from Anthropic and gemini-2.5-pro from Google — from provider families absent from the generator, so they have no corpus contribution at all and their scores are fully independent of every training lineage that produced the data. Together: eleven unique models across seven independent training lineages, running on five live API billing relationships. Replicating the fleet from scratch means accounts, keys, quota, and monitoring with all five.
Per-model Q and A shown: generators reflect the historical v8 self-reviewed reading (preserved for comparison). External reviewers reflect the first Sprint 7 disjoint reading (v13). Composite = round(0.6 × mean(Q) + 0.4 × mean(A)) = 72 on vv16.
Three operations, applied recursively.
Dispatch
Parallel to nine.
A single claim is dispatched to all nine fleet models simultaneously. Per-model response text, timing, and provenance hashes are captured. Grounding verification confirms the atomic transaction is written cleanly. One cycle per topic.
Measure
Quality and adversarial.
Every N cycles the paper is regenerated from the ledger and scored by the six-model reviewer fleet. Four in-fleet reviewers plus two external anchors from Anthropic and Google. Each reviewer reports Q (quality) and A (adversarial) on 0-100. Composite = round(0.6 × mean(Q) + 0.4 × mean(A)). Real token cost is captured from the API response usage fields.
Publish
Always. No threshold.
Every validated paper publishes. The composite is displayed prominently in the paper’s own header blockquote, but it does not gate publication. A new Zenodo DOI is minted per version; the concept DOI always resolves to the latest. The Matthew Effect is the only gate.
Cite it, download it, browse it.
@dataset{salvo_polybrainbench_2026,
author = {Salvo, Andy},
title = {PolybrainBench v8: A Living Benchmark for
Cross-Model Consensus Verification of
Natural-Language Claims},
year = 2026,
publisher = {Zenodo},
version = {v8},
doi = {10.5281/zenodo.19546460},
url = {https://doi.org/10.5281/zenodo.19546460}
}Author: Andy Salvo · ORCID 0009-0008-8629-8827 · Polylogic AI · Penn State Smeal.
Concept DOI (cross-version): 10.5281/zenodo.19546459— resolves to the latest paper on every regeneration.
We also run a live verification API.
The same methodology, exposed as a pay-per-call endpoint for AI agents on the x402 economy. Free trust-profile reads. $0.01 per verification paid automatically via x402 on Base. Backed by the live PolybrainBench fleet.
const trust = await fetch(
"https://trust.polylogicai.com/profile/0x..."
);
const { score } = await trust.json();
if (score < 50) skip(service);const verified = await fetch(
"https://trust.polylogicai.com/verify",
{
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({ claim: aiOutput })
}
);trust.polylogicai.com
Live endpoint
$0.01 USDC
Per verification
Base
Network
x402
Payment protocol