Evaluation Cards: Quickstart

A stakeholder-agnostic guide to getting started. ~6 min.

Heads up: Evaluation Cards is in Beta. We'd love your feedback: report bugs, request features, or tell us what's confusing through our feedback form or the public roadmap. You can reach the feedback form from any page via Feedback in the top navigation.

The homepage hero and the Corpus snapshot: the current totals for models, results, organizations, and benchmarks.

What Evaluation Cards is

Evaluation Cards is a reporting layer over AI model evaluations, built by the EvalEval Coalition. It collects how AI models are evaluated across many benchmarks and reporting organizations, and also shows what was left undocumented.

A benchmark score on its own ("Model X scores 87% on MMLU") tells you very little. Evaluation Cards puts each score in context: who ran the evaluation, how it was set up, whether it can be reproduced, and whether it can be fairly compared to another score. The project treats every published evaluation as a claim, and every undisclosed detail as a claim deliberately not made. Neither is treated as an error.

At a glance (snapshot of June 2026), the corpus tracks:


5,816 models	101,955 reported results
31 reporting organizations	820 model developers
57 benchmark families	632 single benchmarks

The corpus is versioned by snapshot. Every page shows a snapshot date. Numbers above will drift as the corpus grows, so always cite the snapshot you saw.

The four interpretive signals

Every record is assessed against four signals. Learn to read them and most of the site falls into place.

The "Interpretive signals" section on the homepage the four cards: Reproducibility, Completeness, Provenance, Comparability.

Signal	Question it answers	What a low score means
R · Reproducibility	Could a third party re-run this evaluation?	Prompts, decoding settings, harness version, seeds, or code are undisclosed.
C · Completeness	Does the record meet normal reporting expectations for this kind of model?	Whole categories (e.g. safety, robustness, fairness) may be missing.
P · Provenance & Risk	Who ran it, and what real-world property does it measure?	Distinguishes first-party (the developer) from third-party (independent) evaluators.
X · Comparability	Can two scores on the same benchmark be put side by side?	Different splits, metric variants, or units make a direct ranking invalid.

A high benchmark score with weak signals is still a weak claim. The signals are how you tell a well-documented result from one you can't check.

The five-level hierarchy

Scores resolve through an explicit pathway, so any headline number can be drilled down to the evidence behind it:

Family  →  Composite  →  Single Benchmark  →  Split  →  Metric

For example: MMLU (family) → MMLU-Pro (composite) → a single benchmark → a split (e.g. a particular subject or language subset) → accuracy (metric). When you see an aggregate claim, you can always click down to the specific metric supporting it.

Getting around: the top navigation

The top navigation bar Overview · Models · Evaluations · Help · About · Feedback.

Overview (/): the corpus snapshot, the signals explained, and featured benchmark families. Start here.
Models (/models): every indexed model. Filter by parameter size, switch between Models and Developers views, and select up to four models to compare.
Evaluations (/evals): benchmarks grouped into families, filterable by interaction style (agent / non-agent) and ~17 categories (Mathematics, Safety, Software Engineering, …).
Help (/help): guides like this one, plus technical documentation on the signals and how to contribute data.
About (/about): how the signals are computed and the principles behind the corpus.
Feedback (/feedback): share feedback, request a feature, or report an issue.

Your first 5 minutes

Open a model page. Go to Models, click any model (e.g. Claude Opus 4.7). This page is the model's evaluation record. Behind each number sits more than the page shows: the benchmark's own metadata and the evaluation run details, which is what actually makes the score interpretable.

A full model page in Summary View e.g. /models/anthropic/claude-opus-4.7.

Read the DOCUMENTED badge. Near the top, a percentage (e.g. "36%, 14 / 39 reported") tells you for how many of the model's reported scores the prompting/run setup is documented well enough to re-run. The rest are missing details like temperature, max tokens, or the harness. (This is the model-level Reproducibility read; it's not about completed benchmark cards.) A low number here is common, and that's deliberate.
Check "Who reports what" (§3). A donut and per-category bars split results into first-party (the developer's own numbers) vs third-party (independent). This is your fastest read on how independent the evidence is.

The §3 "Who reports what" section of a model page first-party vs third-party breakdown.

Understand the Summary score. In Summary View each benchmark is shown leaderboard-style, ranked against peer models. When a model has more than one reported result for a benchmark, the score shown is the median of all reported results.

The §4 Reported metrics list in Summary View: benchmarks ranked best→worst with the model's score and peer rank.

Switch to Researcher View. The toggle in the bar at the top of the page swaps Summary for Researcher view, which exposes the underlying per-result detail. Use Summary to get oriented, and Researcher when you want to dig in.

The Summary / Researcher View toggle at the top of a model page the snapshot date sits at the right of the same bar.

In Researcher View, §4 lists one row per benchmark with the number of results (N), the mean and 95% CI, and the range of reported scores. Expand a row to see each source's score, its generation settings, and the per-result flags showing which signals each result trips. That's the spread and provenance behind the summary number.

An Overlaps row expanded to show each source's score, settings, and per-result flags.

Note the snapshot date. Every model and evaluation page shows a snapshot date in that same top bar (e.g. Snapshot · Jun 9, 2026), and the homepage shows it in the Corpus snapshot header. The corpus is versioned, so numbers change between snapshots. Cite the snapshot date with any figure so others can find exactly what you saw.

Browsing evaluations

The Evaluations tab is the benchmark-first view. Benchmarks are grouped into families and tagged by category; filter by interaction style (agent / non-agent) or by category to narrow the list.

The Evaluations index, with families and their categories and counts.

Open a family to drill into its structure: the single benchmarks beneath it, their splits, and the metrics reported on each, following the same five-level hierarchy top-down.

A benchmark family opened to show its benchmarks and splits.

Walk through one evaluation

Click a benchmark to open its own page. The At a glance card at the top summarizes what it measures, its main caveat, who it's intended for, and how to read it, with links to the source paper and dataset. Below sit the Benchmark card (schema, methodology, what it measures), Technical details (metric, completeness, comparability, splits), and a "Can these scores be compared directly?" panel.

A benchmark's evaluation page, showing the "At a glance" card measures · caveat · intended for · sources.

Scroll to Reporting Comparison: every model with a reported result on this benchmark, ranked, with the score distribution, the evaluator, the source, and the date. It's the benchmark-centric mirror of the §4 metrics on a model page.

The Reporting Comparison leaderboard on a benchmark's evaluation page.

The chart above the leaderboard has two views, toggled at the top. Distribution shows how all the reported scores are spread across models; Frontier traces the best score over time, showing how the state of the art on this benchmark has advanced as newer models were released.

The Reporting Comparison chart in Frontier view best score over time, with the Distribution / Frontier toggle.

Three things to remember

A score is a claim, not a fact. Read the signals before trusting the number.
First-party ≠ third-party. Always check who produced a result.
Cite the snapshot. The corpus is versioned; numbers change.

Support this effort

Evaluation Cards is a community effort from the EvalEval Coalition, and it gets more useful the more people use, report to, and cite it. If it helps your work, please cite our paper and share it with colleagues. Every role has a concrete way to pitch in: model developers can report their evaluations, evaluation developers can upload their benchmarks, researchers and policymakers can cite and share the work, and anyone can flag a correction. See how to contribute for the details.

➡️ Next: pick the guide for your role: Evaluation researchers · Policymakers · General public · Journalists.