Help

How to use Evaluation Cards.

New here? Start with the quickstart, then dive into a guide written for your role. You can replay the intro tour at any time.

Quickstart

Stakeholder-agnostic · ~6 min

Quickstart

The four signals, the five-level hierarchy, and your first five minutes on the site.

Start reading

Tutorials by stakeholder

Each guide reads the same record through a different lens. Pick the one closest to how you'll use Evaluation Cards.

Researchers auditing or comparing evaluations

Evaluation researchers

Researcher View, a per-section card walkthrough, rigorous signal reading, and comparability caveats.

Governance & policy

Policymakers

What the evidence supports versus what it doesn't, independence checks, and responsible claims.

Teams publishing models

Model developers

How your model is carded, what's documented versus missing, and a checklist to raise your signals.

Everyone

General public

Plain-language: what a benchmark score means and how to read a card.

Reporters

Journalists

The three-minute fact-check, sourcing, citation, and overclaiming traps.

Documentation

Deeper, more technical references for contributing to and working with the data behind Evaluation Cards.

Data sources & structure

What Evaluation Cards is built on

The infrastructure behind the corpus: Auto-BenchmarkCards, Every Eval Ever, IBM Risk Atlas, and the five-level hierarchy.

Signal definitions

How the four signals are computed

The exact computation behind reproducibility, completeness, provenance, and comparability: fields, formulas, and corpus aggregation.

Contributing evaluation data

Cross-post EEE results to Hugging Face

Send your Every Eval Ever results to Hugging Face Community Evals: the YAML schema, the converter, and the backlink to the full EEE record.

Verification

Get a verified checkmark

Submit your data through your organisation's Hugging Face account to have your results show up verified — our call for apples-to-apples comparison.

External · EvalEval

Add results to Every Eval Ever

The Every Eval Ever contributor site explains how to add evaluation results to the datastore that powers Evaluation Cards.

Suggest missing documentation on our public roadmap and we'll make sure to add it!

How to contribute

Evaluation Cards is a living, community artifact — its coverage and usefulness grow as people report, upload, use, and cite it. Here's what helps most, depending on who you are.

Model developers

Report your model's results to Every Eval Ever so they show up here in context.
Already on EEE? Cross-post them to Hugging Face so your scores appear on the model page with a backlink.
Document the run-level details that raise your signals — temperature and max tokens, the harness, and (for agentic evaluations) the eval plan and limits.
See a wrong or missing number for your model? Flag it in the Space discussions or via each record's correction path.

Evaluation developers

Upload your benchmark's results to Every Eval Ever so others can find, run, and reuse them.
Fill in your benchmark's metadata — goals, construct, scoring rubric, intended uses, and limitations — to raise its completeness score.
Report schema gaps or data issues on the EEE issue tracker.

Researchers

Use Evaluation Cards in your model-, evaluation-, or field-level analysis — and cite the paper when you build on it.
Report third-party results you've run to Every Eval Ever — independent numbers are first-class here.
Flag discrepancies or suggest methodology improvements on the issue tracker or in the discussions.
Spread the word — share it with collaborators and on socials.

Policymakers

Consult Evaluation Cards as an evidence base — what's documented, who reported it, and how comparable it is.
Cite the paper in reports and briefings, and point colleagues to the site.
Tell us what evidence you need for decisions — suggest features on the public roadmap or via the feedback form.
Spread the word so more of the field reports legibly.

Spotted an error? A wrong or missing number anywhere in the corpus can be flagged through the feedback form with a source — corrections are versioned, and coverage improves as developers and third parties publish.

Not sure where something fits? The public roadmap, the feedback form, the EEE issue tracker, and the Space discussions are always open.

How to cite

If you find this effort useful, please consider citing our paper and sharing our work on socials.

Reference

Ghosh, A., Reuel, A., Chim, J., Kennedy, W. M., et al. (2026). Evaluation Cards: An Interpretive Layer for AI Evaluation Reporting. arXiv:2606.09809.

BibTeX · Evaluation Cards

@article{ghosh2026evaluationcards,
  title        = {Evaluation Cards: An Interpretive Layer for AI Evaluation Reporting},
  author       = {Ghosh, Avijit and Reuel, Anka and Chim, Jenny and Kennedy, Wm. Matthew and Yadav, Srishti and Mickel, Jennifer and Long, Yanan and Tran, Andrew and Kornilova, Anastassia and Stachura, Damian and Klyman, Kevin and Friedrich, Felix and Sania, Jeba and Lamparth, Max and Batzner, Jan and Mishra, Anoop and Habba, Eliya and Hao, Yixiong and Heath, Nathan and Rismani, Shalaleh and Gohar, Usman and Loehr, Andrea and Manheim, David and Dhar, Ruchira and Nelaturu, Sree Harsha and Sinha, Aarush and Choshen, Leshem and Sharma, Drishti and Khire, Ishan and Saha, Amit and Sahoo, Subramanyam and Hardy, Michael and Riegler, Michael Alexander and Manghnani, Kabir and Lin, Michelle and Jiang, Yanan and Huang, Yilin and Yehudai, Asaf and Ji, Jessica and Hofmann, Aris and Akhtar, Mubashara and Moniz, Nuno and Jernite, Yacine and Biderman, Stella and Talat, Zeerak and Koyejo, Sanmi and Kochenderfer, Mykel and Solaiman, Irene},
  journal      = {arXiv preprint arXiv:2606.09809},
  year         = {2026},
  url          = {https://arxiv.org/abs/2606.09809}
}

Every Eval Ever (EEE) is a sister EvalEval project and one of the data sources that powers Evaluation Cards — please show it some love and cite it too. 💜

BibTeX · Every Eval Ever

@misc{evaleval2026everyevalever,
  title   = {Every Eval Ever: Toward a Common Language for AI Eval Reporting},
  author  = {Jan Batzner and Leshem Choshen and Avijit Ghosh and Sree Harsha Nelaturu and Anastassia Kornilova and Damian Stachura and Yifan Mai and Asaf Yehudai and Anka Reuel and Irene Solaiman and Stella Biderman},
  year    = {2026},
  month   = {February},
  url     = {https://evalevalai.com/infrastructure/2026/02/17/everyevalever-launch/},
  note    = {Blog Post, EvalEval Coalition}
}

Back to home About Evaluation Cards Read the paper