What Evaluation Cards is built on
Evaluation Cards doesn't run evaluations; it composes existing evaluation infrastructure into a single reading surface. Here's what powers it.
Auto-BenchmarkCards
A schema for benchmark-level metadata: what a benchmark measures, its splits, intended use, validity scope, and known limitations. Each benchmark family has an Auto-BenchmarkCard at the family root and a Policy Note compressed for plain-language reading.
Every Eval Ever
A run-level corpus of public evaluation results: (model, benchmark, metric-path, value, source) tuples extracted from papers, model cards, and leaderboards. It provides the raw rows that Evaluation Cards canonicalises and joins. Every Eval Ever is a sister EvalEval project.
IBM Risk Atlas alignment
Risk-domain annotations on benchmarks (capability, robustness, safety, agentic risk, fairness) so policy readers can locate which deployment-relevant property a number speaks to.
The five-level hierarchy
Family → Composite → Single benchmark → Split → Metric
Every score resolves to an explicit path, so aggregate claims drill down to the evidence supporting them.