Eval Cards is a structured interface for browsing reported AI evaluation results. Instead of treating evaluations as scattered tables, blog posts, and benchmark screenshots, it presents them as a comparable evidence layer around models and benchmarks.
The current demo is benchmark-first. It uses model JSON files as the source of truth, derives model and evaluation views from those files, and highlights reporting provenance, benchmark scope, setup differences, and reproducibility gaps directly in the UI.
The goal is not just to show scores. It is to help readers understand who reported them, what was tested, whether comparisons are fair, and where the evidence is thin.
Research mode foregrounds comparability, benchmark setup, eval libraries, generation config, and score behavior.
Policy mode translates the same evidence into plain-language interpretation, evaluator independence, caveats, and comparability warnings.
Independent reporting, linked sources, benchmark breadth, and complete generation settings all make a score easier to trust and compare.
Missing generation config, mixed reporting sources, setup-sensitive benchmarks, and large model-size differences can all weaken apples-to-apples comparison.
The app derives its views from JSON files under the top-level data directory. Model-level and evaluation-level summaries are computed from that reporting layer.
The evaluation community often uses benchmark, eval, metric, and task interchangeably. That ambiguity showed up repeatedly in this project, so we use a more operational set of definitions in the interface.
An individual evaluation with a defined dataset and scoring method.
A collection of single benchmarks reported together, often under a unified leaderboard.
Strictly what is measured and how; not a benchmark nested inside a composite.
If Reward Bench lists factuality under a “metrics” heading, we treat that as a benchmark in this interface. The metric is the scoring rule attached to it, such as binary accuracy.
This is also why the Evaluations page can now group single benchmarks underneath a composite benchmark like HF Open LLM v2 instead of conflating the two.
Start from models if you want breadth and reporting context. Start from evaluations if you want a benchmark-centric view of model performance and methodology.
