Image Generation · Evaluation · CVPR 2025

The models look diverse.
The data says otherwise.

Researchers at Friedrich-Alexander-Universität Erlangen-Nürnberg and Imperial College London ran a simple test: how much of its training distribution does a state-of-the-art image generator actually learn to reproduce? The answer stopped at 77%. And the metrics everyone uses to evaluate these models couldn't detect the shortfall.

Core finding

FID and similar metrics measure how realistic individual images look, not how much of the real distribution a model covers. A generator can score well and still silently skip a quarter of the data it was trained on.

scroll to explore

01The problem

Good scores,
missing data.

The field's primary metric for evaluating image generators — FID — was never designed to measure coverage. It measures distributional distance, which correlates with quality. These are not the same thing.

Since 2014, the standard way to evaluate a generative model has been to ask: do the images it produces look real? The Fréchet Inception Distance (FID) operationalizes this by comparing the statistical distribution of generated images against the distribution of real ones. Lower FID generally means better quality. The metric became ubiquitous. Papers compete on it. Models are ranked by it.

The problem is that distributional distance is not the same as distributional coverage. A model could learn to generate perfect-looking images from one part of the real distribution while ignoring another part entirely. Its FID score might still be excellent. The gap would be invisible to the evaluator.

This is not a theoretical concern. The research team measured it. They found that no state-of-the-art image generator covered more than 77% of its training data's diversity. The remaining fraction, often a quarter or more of the real distribution, was simply absent from the model's learned outputs. A generator evaluated on FID alone would show no sign of this.

The question this paper asks

If standard metrics can't detect diversity failures in generative models, how would you know you had one? And if you knew, how would you fix it without sacrificing the image quality the field has spent years optimizing?

The diversity problem matters most where real-world distributions are long-tailed. Medical imaging, for instance, contains many common presentations and a smaller number of rare ones. A generator trained on such data might learn the common cases well and ignore the rare ones. If that generator is used to augment a training dataset for a diagnostic model, the augmented dataset would be biased toward the already-overrepresented cases. The gap in the generator propagates into the downstream model invisibly, because the FID looked fine.

02The experiment

Two tools.
One for measuring. One for fixing.

The paper's contribution is split into two parts. First, a metric that actually detects coverage gaps: the Image Retrieval Score. Second, a model variant that addresses the gaps in unconditional diffusion models: DiADM.

The team's first observation was that the standard feature extractors used for evaluation — Inception v3, DINOv2, CLIP — are not well-suited to measuring diversity. They were trained to recognize image content or align visual and text features, not to detect whether a generative model is systematically skipping parts of a distribution. Plugging these extractors into diversity-aware metrics gives unreliable results.

Image Retrieval Score (IRS)

Frames diversity measurement as an information retrieval problem. Use synthetic images as queries: how many real training images can each synthetic image successfully retrieve? A generator that covers more of the training distribution will, on average, retrieve more distinct real images. IRS is interpretable (it reads as a coverage percentage), hyperparameter-free, and requires no class labels. The team validated it against ground-truth known-diversity conditions where other metrics failed to differentiate.

Diversity-Aware Diffusion Models (DiADM)

Addresses the coverage problem in unconditional diffusion models. Standard unconditional models have no guidance at sampling time: the model picks from the distribution freely, and tends to cluster around modes. DiADM replaces this with pseudo-unconditional conditioning. Inception v3 features are extracted from training images and passed directly into the model as conditioning signals. At inference time, you can specify a target region of the distribution by providing a feature vector, steering the model toward underrepresented areas. No class labels required.

Benchmark evaluation

The team tested across standard image generation benchmarks including CIFAR, FFHQ, and ImageNet. They compared DiADM against EDM-2, a strong unconditional diffusion baseline, measuring both IRS (coverage) and FID (quality) to verify that diversity improvements did not come at the cost of image realism.

What “pseudo-unconditional” means here

The term sounds like a contradiction. The point is that the model receives conditioning in the form of feature embeddings from real training images, but no explicit class label or text prompt. It behaves like an unconditional model from the user's perspective (you can sample freely), while having an internal mechanism that makes it possible to target specific regions of the distribution if you choose. The conditioning is derived from the data itself, not from external supervision.

03Findings

No model passed
the 77% mark.

Three findings across the evaluation. The headline number is 77%. The more important finding is that the metrics everyone uses can't see this problem at all.

Max IRS coverage found

77%

Best result among SOTA models tested

DiADM benchmark wins

3/3

Exceeded real reference diversity in 3 benchmark settings

Standard metrics that detect the gap

FID, KID, Inception Score all fail to surface coverage failures

Finding 1: The 77% ceiling

Every state-of-the-art image generator the team evaluated fell short of covering the full diversity of its training data. The best coverage any model achieved was 77%, measured by IRS. This means that even the strongest unconditional diffusion models, trained on large curated datasets, systematically skip at least 23% of the distribution they were supposed to learn.

Importantly, these same models often have competitive FID scores. The quality of individual generated images is not the issue. The issue is which images the models choose not to generate.

Finding 2: Standard feature extractors miss the problem

The team tested whether commonly used feature extractors — Inception v3, DINOv2, CLIP — could detect diversity gaps when plugged into standard diversity metrics. None of them reliably could. This is a calibration problem: the extractors were designed for tasks other than distribution coverage assessment, and their embedding spaces do not faithfully reflect what it means for a generated set to be diverse relative to a training set.

This matters because it means researchers relying on precision, recall, or coverage metrics built on these extractors are receiving unreliable signals about their models' diversity. The IRS sidesteps this by using the retrieval task directly as the diversity proxy, bypassing the extractor calibration issue.

Finding 3: DiADM improves diversity without hurting quality

Against EDM-2 on every tested benchmark, DiADM improved IRS. In three benchmark settings, it exceeded the diversity of the real reference dataset itself, meaning the pseudo-conditional guidance successfully pushed the model into underrepresented regions of the distribution that even the training data sampling had underexplored. FID scores remained competitive, confirming that the diversity gains were not purchased by generating noisier or less realistic images.

Scope and limitations

DiADM is demonstrated on unconditional diffusion models. The extension to text-to-image models, conditional models, or other generative architectures is not covered in the paper. The approach depends on having access to training-time feature embeddings, which may not be available in all deployment scenarios. The diversity gains are demonstrated on standard image generation benchmarks; how the technique transfers to highly specialized domains (medical imaging, satellite imagery) is an open question.

04Practical takeaways

What this means
for people who build with generative models.

The core shift is simple: quality and coverage are separate properties of a generative model, and you should measure both. FID tells you about quality. It tells you nothing about which parts of the distribution you are missing.

For teams evaluating generative AI

Add a coverage metric to your evaluation pipeline alongside FID. IRS is available via the open-source BeyondFID package. A model with a competitive FID and a low IRS is telling you something important: it has learned to produce convincing images from a narrow slice of the distribution. That narrowing might be fine for your use case, or it might not. You should know which.

For practitioners in high-stakes domains

Medical imaging, rare-event detection, and fairness applications are all vulnerable to silent coverage failures. A generator trained to augment a medical imaging dataset might produce realistic-looking X-rays while systematically ignoring the rare presentations. The resulting augmented training data would be biased in ways FID would not surface. Measuring coverage before using a generator for data augmentation is a reasonable step to add to the workflow.

For model builders working with unconditional diffusion

DiADM's pseudo-conditioning approach offers a concrete route to improving distribution coverage without requiring class labels or text prompts. If your model is showing coverage gaps under IRS, the technique is available in the Trichotomy repository. The computational overhead of adding feature conditioning to training is real but modest relative to the coverage gains on the tested benchmarks.

For researchers and benchmark designers

FID has outlived its role as the single figure of merit for generative models. The field has known this abstractly for several years. This paper provides concrete evidence that current feature extractors used in diversity metrics are miscalibrated for the task, and offers a cleaner alternative. Adding IRS as a standard reported metric alongside FID would give a more honest picture of what models are actually learning.

A note on what this paper does not claim

DiADM is an unconditional model technique. The paper does not address text-to-image diversity, conditional generation, or video. The 77% coverage ceiling is a finding about SOTA unconditional image generators tested on standard benchmarks. Applying the same measurement to very different domains would require validating that IRS remains calibrated for those distributions.

05Further exploration

Where to go
from here.

If you want to measure or address coverage gaps in your own models.

Read the paper

Dombrowski, M., Zhang, W., Cechnicka, S., Reynaud, H., & Kainz, B. (2025). Image Generation Diversity Issues and How to Tame Them. Proceedings of CVPR 2025, pages 3029–3039. Friedrich-Alexander-Universität Erlangen-Nürnberg & Imperial College London. arXiv:2411.16171.

Run IRS on your model with BeyondFID

The open-source BeyondFID package implements IRS alongside FID, KID, and other standard metrics. It is designed for unconditional image generation evaluation. Install it and run IRS on your current model before your next benchmark comparison.

Explore DiADM in the Trichotomy repository

The Trichotomy repository (linked from BeyondFID's README) contains the DiADM implementation. If IRS reveals a coverage gap in your unconditional model, this is the starting point for addressing it without rebuilding your architecture.

Audit your data augmentation pipeline

If you use a generative model to produce training data for a downstream task, run IRS on the generator before relying on its outputs. If coverage is below 80%, the augmented data will be systematically missing parts of the distribution. Factor that into how you weight the synthetic samples.

Review the diversity measurement landscape

“Benchmarking Diversity in Image Generation via Attribute-Conditional Human Evaluation” (arXiv:2511.10547) is a complementary paper that approaches diversity measurement from a different angle, using attribute-conditional human evaluation. Reading both gives a broader picture of how the community is approaching this problem.

The models look diverse.The data says otherwise.

Good scores,missing data.

Two tools.One for measuring. One for fixing.

No model passedthe 77% mark.

What this meansfor people who build with generative models.

Where to gofrom here.

The models look diverse.
The data says otherwise.

Good scores,
missing data.

Two tools.
One for measuring. One for fixing.

No model passed
the 77% mark.

What this means
for people who build with generative models.

Where to go
from here.