Trust Scores for AI Deliverables

Most AI systems have no memory of their own track record. They pass or fail each run in isolation, with no mechanism to distinguish a system that has been reliable for six months from one deployed yesterday. Trust scores solve this by treating validation history the way credit bureaus treat financial history: recent performance matters most, critical failures leave lasting marks, and trust without fresh evidence decays. The result is a single number that compounds over time and gates real operational decisions.

The Problem: No Memory

Run an AI validation pipeline today and it tells you whether the output passed or failed. Run it again tomorrow and you get another binary verdict. Neither run knows the other exists. There is no accumulation, no trajectory, no distinction between a system with fifty consecutive clean runs and one that failed catastrophically last week.

This is the state of most AI quality assurance. Each evaluation is an isolated event. A passing score on Tuesday carries no weight on Wednesday. A critical failure on Monday is forgotten by Friday. The system cannot answer the most basic question a decision-maker would ask: is this getting better or worse?

For AI systems that produce client-facing deliverables, this gap has real consequences. A deployment pipeline that treats every run identically cannot distinguish between a high-confidence deploy and a lucky pass on a volatile system. It cannot slow down when things are trending poorly. It cannot grant autonomy to systems that have earned it. Without history, there is no basis for graduated trust.

How Existing Systems Build Trust

The problem of quantifying trustworthiness from behavioral history has been solved repeatedly outside of AI. Each solution shares a common structure: observe behavior over time, weight recent evidence more heavily, penalize severe failures disproportionately, and tie the resulting score to real consequences.

System	What It Scores	Key Mechanism	What It Gates
FICO	Creditworthiness	Recency bias, severity scaling, 7-year lookback	Access to capital
PageRank	Page authority	Dampening factor (0.85) prevents single-link dominance	Search ranking position
eBay Seller Ratings	Transaction reliability	Cumulative feedback percentage, visible star tiers	Buyer confidence, search placement
Stack Overflow	Expertise	Reputation from votes, bounties, accepted answers	Editing, moderation, close privileges

The pattern is consistent. FICO does not treat a bankruptcy the same as a late payment. PageRank does not let a single inbound link determine authority. Stack Overflow does not give moderation privileges to new accounts regardless of one good answer. In each case, trust is earned through sustained performance and lost quickly through severe failure. The score has consequences, which is what makes it meaningful rather than decorative.

Exponential Decay for AI Validation

The right mathematical tool for this is the Exponentially Weighted Moving Average. EWMA gives recent observations exponentially more influence than older ones without discarding history entirely. Each new validation run contributes a fixed percentage of the score update. The remaining weight comes from the accumulated past.

With a smoothing factor of 0.15, each new run contributes 15% of the update. A run from twenty cycles ago contributes roughly 4%. A run from fifty cycles ago contributes less than 0.05%. The score naturally forgets old performance without requiring explicit expiration rules or fixed window sizes.

But time decay alone is not enough. Critical failures need event penalties on top of the EWMA update. A catastrophic validation failure should drop the score by a fixed amount that can only be recovered through subsequent clean runs. This mirrors FICO's treatment of severe delinquencies: a bankruptcy damages a credit score far more than a late payment, and recovery requires sustained good behavior, not one good month.

The asymmetry is deliberate. Dropping fifteen points takes one bad run. Recovering fifteen points takes eight to twelve clean runs. This is what separates a trust score from a vanity metric. A vanity metric goes up and never comes down. A trust score has teeth.

Detecting Regressions, Not Just Failures

Static thresholds catch absolute failures. A score below 60 is bad regardless of context. But static thresholds miss contextual regressions. A system that consistently scores 88-92 dropping to 72 should trigger investigation even though 72 clears any static floor. The regression itself is the signal, not the absolute number.

The EWMA baseline enables this. By tracking each system's own historical mean and standard deviation, deviations become relative to that system's established behavior. A z-score measures how many standard deviations a new result falls from the system's own baseline. Two standard deviations below the mean flags an anomaly. The same absolute score might be normal for one system and alarming for another.

This is how industrial quality control has worked for decades. Shewhart control charts, introduced in the 1920s at Bell Labs, established the principle that variation within a process should be measured against that process's own behavior, not against a universal standard. EWMA control charts refined this by weighting recent observations more heavily, making them sensitive to small, sustained shifts that fixed-window methods miss. The same statistical machinery applies directly to AI validation scores.

What Trust Scores Should Gate

A trust score that does not gate anything is a dashboard widget nobody checks. The score becomes meaningful only when it has operational consequences.

The minimum viable integration: a system with a trust score below a threshold requires manual approval before deployment, even if its most recent run passed. A passing run on a low-trust system is not the same as a passing run on a high-trust system. The trust score encodes the difference.

Trust Range	Operational Consequence
Above 75	Auto-deploy permitted. System has earned autonomy through sustained performance.
50 to 75	Manual review required before deploy. Passing runs are not sufficient without trust history.
Below 50	Artifacts quarantined. Alert triggered. No deployment until trust is rebuilt through validated runs.

Staleness matters as much as the number itself. A score of 85 with a last-validated timestamp of three days ago means something different than 85 validated forty-seven days ago. Trust without fresh evidence should decay. Two points per week of inactivity is a reasonable starting position. This prevents a system that ran clean six months ago from carrying a healthy score when nobody has validated it since.

Bootstrapping Trust: The Cold Start Problem

A new system has no validation history. The trust score must handle this without giving unearned confidence or penalizing the system for being new.

The solution mirrors how credit scoring handles thin files. New systems start at a neutral score, neither high nor alarming. The score is marked provisional until a minimum number of validation runs exist. During the provisional period, each run carries double the normal EWMA weight, so the score converges to reality faster. After sufficient runs accumulate, the score switches to standard weighting and the provisional label drops.

The provisional label is not cosmetic. It communicates uncertainty that a bare number does not. “Trust: 72 (provisional, 4 runs)” tells a fundamentally different story than “Trust: 72 (stable, 47 runs).” Both are 72. Only one has earned it.

This is the architectural choice that separates a meaningful system from a reporting artifact. Every design decision follows from whether the score has consequences: whether it can go down, whether recovery is harder than damage, whether staleness is visible, and whether the number gates something real. A trust score without teeth is a number on a dashboard. A trust score with teeth is a governance mechanism.

Sources

Roberts, S. W. (1959). “Control Chart Tests Based on Geometric Moving Averages.” Technometrics, 1(3), 239-250. Original EWMA paper establishing exponential smoothing for quality control.
Lucas, J. M. & Saccucci, M. S. (1990). “Exponentially Weighted Moving Average Control Schemes: Properties and Enhancements.” Technometrics, 32(1), 1-12. EWMA parameter selection and average run length analysis.
Montgomery, D. C. (2019). Introduction to Statistical Quality Control, 8th ed. Wiley. Standard reference for EWMA control chart theory and lambda selection.
Fair Isaac Corporation. (2024). What's in Your FICO Score. FICO scoring methodology: five weighted categories with recency bias and severity scaling.
Page, L., Brin, S., Motwani, R., & Winograd, T. (1999). The PageRank Citation Ranking: Bringing Order to the Web. Stanford InfoLab. Dampening factor and iterative trust propagation.