About

About

A benchmark-first interface for reading AI evaluation evidence with separate researcher and policy lenses.

ModesResearch + Policy
SourceStructured JSON
What This Demo Is

Eval Cards is a structured interface for browsing reported AI evaluation results. Instead of treating evaluations as scattered tables, blog posts, and benchmark screenshots, it presents them as a comparable evidence layer around models and benchmarks.

The current demo is benchmark-first. It uses model JSON files as the source of truth, derives model and evaluation views from those files, and highlights reporting provenance, benchmark scope, setup differences, and reproducibility gaps directly in the UI.

The goal is not just to show scores. It is to help readers understand who reported them, what was tested, whether comparisons are fair, and where the evidence is thin.

What You Can Explore
Model snapshots
See benchmark breadth, reporting sources, top signals, and accountability context for a given model.
Evaluation leaderboards
See which models were reported on a benchmark, how they rank, and what methodological or reporting context is attached.
Benchmark detail
Compare setup changes, subtasks, and score spread within a single model’s reported benchmark results.
Research Mode

Research mode foregrounds comparability, benchmark setup, eval libraries, generation config, and score behavior.

Methodology firstConfig-aware comparisonsSample and metric detail
Policy Mode

Policy mode translates the same evidence into plain-language interpretation, evaluator independence, caveats, and comparability warnings.

Plain-language summariesAccountability cuesVisible limitations
What Counts As Stronger Evidence

Independent reporting, linked sources, benchmark breadth, and complete generation settings all make a score easier to trust and compare.

What To Treat Carefully

Missing generation config, mixed reporting sources, setup-sensitive benchmarks, and large model-size differences can all weaken apples-to-apples comparison.

Current Data Model

The app derives its views from JSON files under the top-level data directory. Model-level and evaluation-level summaries are computed from that reporting layer.

Terminology We Use

The evaluation community often uses benchmark, eval, metric, and task interchangeably. That ambiguity showed up repeatedly in this project, so we use a more operational set of definitions in the interface.

Single benchmark

An individual evaluation with a defined dataset and scoring method.

GSM8KIFEvalMMLU-Pro
Composite benchmark

A collection of single benchmarks reported together, often under a unified leaderboard.

Open LLM LeaderboardHELM InstructHF Open LLM v2
Metric

Strictly what is measured and how; not a benchmark nested inside a composite.

Accuracypass@1F1binary accuracy
Important example

If Reward Bench lists factuality under a “metrics” heading, we treat that as a benchmark in this interface. The metric is the scoring rule attached to it, such as binary accuracy.

This is also why the Evaluations page can now group single benchmarks underneath a composite benchmark like HF Open LLM v2 instead of conflating the two.

Reading This Interface Responsibly
This interface helps you ask:
What benchmarks were actually reported for this model?
Who reported those results and how independent were they?
Are score differences likely to reflect setup changes rather than capability?
Which benchmarks have broader support versus thin evidence?
This interface should not imply:
That a single score is a complete picture of a model.
That all reported results are directly comparable.
That missing methodology can be ignored if the numbers look strong.
That benchmark coverage is the same thing as deployment readiness.
Explore the evidence layer

Start from models if you want breadth and reporting context. Start from evaluations if you want a benchmark-centric view of model performance and methodology.