Evaluation Cards for Model Developers

How your model appears in the corpus, how to read what's documented vs. missing, and how to make your evaluation reporting stronger.

Prerequisite: the Quickstart (~6 min) covers the four signals, the five-level hierarchy, and the snapshot model.

Why you should care how your model is carded

Whatever you publish about your model's evaluations, Evaluation Cards re-presents it as a structured record alongside everyone else's, and also shows what you did not disclose. The project treats a published score as a claim and an undisclosed detail as a claim deliberately not made (not an error). That means your card is, in effect, a public read on how legible and verifiable your reporting is.

A high benchmark number with weak signals reads as a weak claim. Most of those gaps are cheap to close, and this guide is a checklist for doing so.

The homepage with the corpus snapshot and the four signals.

Step 1: Find your model (and your org)

Models view (/models): search for your model, filter by parameter range, and open its page.

Developers view: toggle Models → Developers on the same page to see your organization's footprint across all its models.

The /models page with the Developers toggle active.

Step 2: Read your page honestly

Open your model's page in Summary View, then switch to Researcher View to see the per-result detail others will scrutinize.

The same page after clicking Researcher View.

Three things to look at first:

The DOCUMENTED badge (e.g. "36%, 14 / 39 reported"). This is the headline read on your reporting hygiene: how much of your reported record is fully specified. Treat a low number as a backlog, not a verdict.

The page header with the DOCUMENTED badge.

§1 Identification. Confirm the basics are right: model name, developer, release date, modalities, system ID. Errors here propagate everywhere.

§2 Benchmark coverage. What categories are present, and where the split spread is. Gaps here are visible to everyone reading your page.

Step 3: Work the four signals as a reporting checklist

Each signal maps to concrete, mostly low-cost actions.

🔁 Reproducibility: make it re-runnable

Disclose, per result: setup variants, prompts, decoding parameters, harness name + version, random seeds, and code/artifacts. Corpus-wide only ~3% of scores have complete setup documentation, so even modest disclosure stands out.

✅ Publish the exact harness + version (e.g. the eval framework and commit).
✅ State decoding settings (temperature, top-p, max tokens) and seeds.
✅ Link runnable code or a config, not just a number.

📋 Completeness: cover the categories that matter

Completeness is judged relative to expectations for your model class. A strong capability record that is silent on safety, robustness, or fairness reads as incomplete.

✅ Report beyond the flattering capability benchmarks.
✅ If a category is intentionally out of scope, the absence will still show, so consider documenting why.

👤 Provenance & Risk: invite independent evaluation

Your page splits results into first-party (your own) vs third-party (independent), and maps results to IBM Risk Atlas risk domains. A category that is entirely first-party is a visible blind spot: impressive, but uncorroborated.

The §3 "Who reports what" first-party vs third-party breakdown.

✅ Encourage / enable third-party evaluation, especially for safety-relevant claims.
✅ Don't expect self-reported numbers alone to carry weight with careful readers.

⚖️ Comparability: report so your scores can be compared

The site flags when two scores on the same benchmark used different splits, metric variants, or units, which invalidates direct comparison.

The §4 Reported metrics section with benchmark charts.

✅ Use the standard splits and metric definitions for each benchmark.
✅ State units explicitly; don't silently switch metric variants.
✅ Use the Evaluations page to see how others report the same benchmark family.

A benchmark family detail page e.g. /evals?family=artificial-analysis.

Step 4: Corrections and snapshots

The corpus follows snapshot discipline: no retroactive edits, only versioned corrections. Practically:

A number you published is attributed to its source document and tied to a dated snapshot, so it won't be silently rewritten.
If something is genuinely wrong or missing, the path is a correction in a future snapshot, not an edit-in-place. Keep your own source documents stable and clearly dated so they can be cited cleanly.
Because nothing is imputed, the fastest way to improve your page is simply to report more, and report it precisely, and the corpus will reflect it on the next snapshot.

A pre-launch checklist

Before your next model announcement, check that each headline benchmark claim ships with:

Harness name + version, and a link to runnable code/config
Decoding settings and seeds
The exact split and metric (with units), matching how the benchmark is normally reported
Coverage beyond capability: at least some safety / robustness / fairness results
At least one path to independent (third-party) evaluation for key claims
A stable, dated source document for every number

Closing these is what moves your DOCUMENTED percentage, and turns marketing numbers into claims that hold up.

➡️ Related: Evaluation researchers (how researchers will scrutinize your page) · Journalists (how reporters will source your claims) · Quickstart.