PolybrainBench.

An open public benchmark from Polylogic AI that measures how AI systems disagree. We ask the same question to nine different AIs at once — ChatGPT, Grok, Llama, Kimi, Qwen, and others — and publish exactly where their answers diverge.

This is the research. The same team ships Polybrain, our agent platform, built on the same verification primitives.

We asked 9 AI systems the same question: “Which planet has the coldest atmosphere in our solar system?”

5of 9 AIs said

Neptune

4of 9 AIs said

Uranus

Hover any model name above to read its exact answer.

10,452 questions asked · 11 AI systems · 7 AI companies · Free to read, cite, and build on

A living benchmark for how AI models disagree.

Most benchmarks measure whether a single AI can answer a question correctly. PolybrainBench measures something different: when you ask the same question to nine different AIs, where do they disagree, and how big is the gap?

Every verification cycle dispatches one claim to 9independently trained models in parallel. Every response is captured with its full text, its timing, and a cryptographic provenance stamp. No model sees any other model’s output during the cycle. No reviewer grades the cycle while it’s in flight. The pattern of who takes which position IS the data the benchmark publishes.

Every N cycles the daemon regenerates a paper from the current ledger and validates it against a separate six-model reviewer fleet. Two of those six are external anchors from provider families absent from the generator (Anthropic and Google), so their scores are fully independent. The other four are drawn from the generator to preserve the character of its strictest and most generous voices. Eleven unique models across seven independent training lineages touch the pipeline end to end. The paper regenerates and republishes itself.

The paper is free to read, cite, and build on. 10,452 verification cycles in the published dataset, growing every day. Hosted on Zenodo with a permanent identifier; the full ledger is on Hugging Face.

Read the paper→Why disagreement is the signal↓Source on GitHub→

When AIs disagree, that’s where uncertainty actually lives.

You already experience this. You ask Claude and GPT the same factual question and get different answers. You check Wikipedia against what an AI told you and find quiet discrepancies. Two well-aligned models can deliver contradictory confident statements about the same claim, and neither one will tell you which is right.

That gap is the measurement. And until now, no public benchmark was capturing it systematically.

Every existing LLM benchmark takes a different approach: it tests one model at a time against a fixed answer key and publishes an accuracy percentage. That measures something real, but it doesn’t tell you what actually happens in practice, which is that you’re using several AIs at once and they keep giving you different answers to the same questions.

PolybrainBench captures the full response from nine independently trained AIs for every claim, across four separate commercial providers. When all nine agree, the claim is probably settled knowledge. When they split, the claim is in a zone where no single AI should be trusted without external evidence. The disagreement pattern is the signal no single model can give you, and the signal no accuracy benchmark can give you either.

If you publish research

A public corpus of cross-model disagreement.

Nine full responses per claim, four providers, CC-BY-4.0. Cite the DOI. Download the JSONL. Train on it, analyze it, disagree with it.

If you build with AI

A reliability signal for when to doubt.

Check the canonical page for a claim before you act on a model’s answer. If nine AIs already disagreed on it, you probably should too.

If you’re just reading

A window into what AIs don’t actually know.

Browse any of the 7,004 claim pages and watch nine models try to agree. The ones where they split are the interesting ones.

The same claim. Nine models. In parallel. No answer key.

Every verification cycle is one claim, dispatched in parallel to all nine fleet models. The full response text from each model is captured and stamped with a SHA-256 provenance hash. Per-model response time is recorded in milliseconds. No model sees any other model’s output during the cycle. No reviewer grades the cycle while it’s in flight.

There is no hidden answer key. The benchmark does not assume any particular position on any claim is correct. It measures the pattern of which models take which position, and publishes the disagreement as the data.

Every N cycles the daemon regenerates the paper from the current ledger and validates the paper against a six-model reviewer fleet. Two of the six are external anchors from provider families absent from the generator fleet: claude-sonnet-4-5 from Anthropic and gemini-2.5-pro from Google. They have no corpus contribution at all, so their scores are fully independent of the training lineages that produced the data. The other four reviewers are drawn from the generator fleet to preserve its strictest and most generous voices, and they grade the paper’s aggregate analysis of the full ledger rather than their own isolated per-claim outputs.

The living paper is always the current canonical artifact. No threshold gates publication. The composite score is a property of the published paper, not a precondition for it. The Matthew Effect is the only gate.

Growing by about two thousand cycles a day.

At paper v16 the corpus is a snapshot of a living artifact, not a fixed release. The measurement is the full response text from every model on every claim, not a summary or an accuracy score. The corpus is designed to grow: every 6 hours a scheduled daemon adds about five hundred new cycles, targeted at claim shapes where models are most likely to disagree.

10,452

verification cycles

at paper v16

94,068

full model responses

text + timing + provenance

~2,000

cycles per day

autonomous cron

honest composite

Q 75 · A 67

Topic selection is adversarial-by-design.The topic generator rejects consensus trivia and targets claim shapes where models are most likely to diverge: non-round specific numbers, contested historical dates, cross-field technical definitions, near-training-cutoff events, named standards with effective dates, and common misconceptions stated flatly in their corrected form. “Paris is the capital of France” is forbidden; “The SI redefinition of the kilogram took effect on May 20, 2019, based on a fixed Planck constant of 6.62607015 × 10³⁻⁴ joule-seconds” is the shape we generate.

Real token cost is captured from the API response usage fields. A validator run on paper v16cost approximately $0.040 in real API tokens across the six-model disjoint reviewer fleet ($0.026 Claude Sonnet 4.5 + $0.011 Gemini 2.5 Pro + $0.003 across the in-fleet OpenAI and xAI reviewers; Groq reviewers are free-tier). The dataset is not just citable, it’s auditable: you can check what every measurement cost to produce.

Eleven models. Seven training lineages. Five API providers.

Two separate fleets do two separate jobs. The generator fleet of nine models dispatches against every claim in the corpus. The reviewer fleet of six models validates the paper about that corpus. Four reviewers are drawn from the generator to preserve its strictest and most generous voices; they grade the paper’s aggregate analysis of the full ledger rather than their own isolated per-claim responses. Two reviewers are external anchors — claude-sonnet-4-5 from Anthropic and gemini-2.5-pro from Google — from provider families absent from the generator, so they have no corpus contribution at all and their scores are fully independent of every training lineage that produced the data. Together: eleven unique models across seven independent training lineages, running on five live API billing relationships. Replicating the fleet from scratch means accounts, keys, quota, and monitoring with all five.

Model

Provider

Role

What it brings

gpt-4.1-mini

OpenAI

Generator

OpenAI's workhorse. Reliable across every claim type.

gpt-4.1-nano

OpenAI

Generator

OpenAI's fastest. Catches consensus claims cheaply.

grok-3-mini

xAI

Generator

xAI's reasoning specialist. Reports its own confidence on every answer.

grok-4-fast

xAI

Generator

xAI's strictest voice. Finds weaknesses other models miss.

kimi-k2

Moonshot (via Groq)

Generator

Moonshot's frontier model. Best at novel reasoning tasks.

qwen3-32b

Alibaba (via Groq)

Generator

Alibaba's Qwen. Different cultural context for contested historical claims.

gpt-oss-120b

OpenAI (via Groq)

Generator

OpenAI's open-weights release. The strictest reviewer in the fleet.

llama-4-scout

Meta (via Groq)

Generator

Meta's newest. Fastest end-to-end response in the fleet.

llama-3.3-70b

Meta (via Groq)

Generator

Meta's flagship. Most generous scorer — a needed counterweight to the strict pair.

claude-sonnet-4-5

Anthropic

External

Anthropic's frontier model. External anchor — does not contribute to the corpus, grades the paper alone.

gemini-2.5-pro

Google

External

Google's reasoning model. External anchor — does not contribute to the corpus, grades the paper alone.

Per-model Q and A shown: generators reflect the historical v8 self-reviewed reading (preserved for comparison). External reviewers reflect the first Sprint 7 disjoint reading (v13). Composite = round(0.6 × mean(Q) + 0.4 × mean(A)) = 72 on vv16.

Three operations, applied recursively.

Dispatch

Parallel to nine.

A single claim is dispatched to all nine fleet models simultaneously. Per-model response text, timing, and provenance hashes are captured. Grounding verification confirms the atomic transaction is written cleanly. One cycle per topic.

Measure

Quality and adversarial.

Every N cycles the paper is regenerated from the ledger and scored by the six-model reviewer fleet. Four in-fleet reviewers plus two external anchors from Anthropic and Google. Each reviewer reports Q (quality) and A (adversarial) on 0-100. Composite = round(0.6 × mean(Q) + 0.4 × mean(A)). Real token cost is captured from the API response usage fields.

Publish

Always. No threshold.

Every validated paper publishes. The composite is displayed prominently in the paper’s own header blockquote, but it does not gate publication. A new Zenodo DOI is minted per version; the concept DOI always resolves to the latest. The Matthew Effect is the only gate.

Cite it, download it, browse it.

Citation · BibTeX

@dataset{salvo_polybrainbench_2026,
  author    = {Salvo, Andy},
  title     = {PolybrainBench v8: A Living Benchmark for
               Cross-Model Consensus Verification of
               Natural-Language Claims},
  year      = 2026,
  publisher = {Zenodo},
  version   = {v8},
  doi       = {10.5281/zenodo.19546460},
  url       = {https://doi.org/10.5281/zenodo.19546460}
}

Download and explore

Paper v8 on Zenodo

DOI 10.5281/zenodo.19546460

→

Source on GitHub

andysalvo/polybrain

→

Dataset on Hugging Face

andysalvo/polybrainbench-v8

→

Canonical claim pages

7,004 stable URLs

→

Sitemap

Indexed by Google and Bing

→

Live Trust API (x402)

Paid verification endpoint

→

License: CC-BY-4.0 on paper, dataset, and canonical pages. Code is proprietary.
Author: Andy Salvo · ORCID 0009-0008-8629-8827 · Polylogic AI · Penn State Smeal.
Concept DOI (cross-version): 10.5281/zenodo.19546459— resolves to the latest paper on every regeneration.

We also run a live verification API.

The same methodology, exposed as a pay-per-call endpoint for AI agents on the x402 economy. Free trust-profile reads. $0.01 per verification paid automatically via x402 on Base. Backed by the live PolybrainBench fleet.

Check trust before paying · free

const trust = await fetch(
  "https://trust.polylogicai.com/profile/0x..."
);
const { score } = await trust.json();
if (score < 50) skip(service);

Verify any AI output · $0.01 x402

const verified = await fetch(
  "https://trust.polylogicai.com/verify",
  {
    method: "POST",
    headers: { "Content-Type": "application/json" },
    body: JSON.stringify({ claim: aiOutput })
  }
);

trust.polylogicai.com

Live endpoint

$0.01 USDC

Per verification

Base

Network

x402

Payment protocol