The models look diverse.
The data says otherwise.
Researchers at Friedrich-Alexander-Universität Erlangen-Nürnberg and Imperial College London ran a simple test: how much of its training distribution does a state-of-the-art image generator actually learn to reproduce? The answer stopped at 77%. And the metrics everyone uses to evaluate these models couldn't detect the shortfall.
Good scores,
missing data.
The field's primary metric for evaluating image generators — FID — was never designed to measure coverage. It measures distributional distance, which correlates with quality. These are not the same thing.
Since 2014, the standard way to evaluate a generative model has been to ask: do the images it produces look real? The Fréchet Inception Distance (FID) operationalizes this by comparing the statistical distribution of generated images against the distribution of real ones. Lower FID generally means better quality. The metric became ubiquitous. Papers compete on it. Models are ranked by it.
The problem is that distributional distance is not the same as distributional coverage. A model could learn to generate perfect-looking images from one part of the real distribution while ignoring another part entirely. Its FID score might still be excellent. The gap would be invisible to the evaluator.
This is not a theoretical concern. The research team measured it. They found that no state-of-the-art image generator covered more than 77% of its training data's diversity. The remaining fraction, often a quarter or more of the real distribution, was simply absent from the model's learned outputs. A generator evaluated on FID alone would show no sign of this.
If standard metrics can't detect diversity failures in generative models, how would you know you had one? And if you knew, how would you fix it without sacrificing the image quality the field has spent years optimizing?
The diversity problem matters most where real-world distributions are long-tailed. Medical imaging, for instance, contains many common presentations and a smaller number of rare ones. A generator trained on such data might learn the common cases well and ignore the rare ones. If that generator is used to augment a training dataset for a diagnostic model, the augmented dataset would be biased toward the already-overrepresented cases. The gap in the generator propagates into the downstream model invisibly, because the FID looked fine.
Two tools.
One for measuring. One for fixing.
The paper's contribution is split into two parts. First, a metric that actually detects coverage gaps: the Image Retrieval Score. Second, a model variant that addresses the gaps in unconditional diffusion models: DiADM.
The team's first observation was that the standard feature extractors used for evaluation — Inception v3, DINOv2, CLIP — are not well-suited to measuring diversity. They were trained to recognize image content or align visual and text features, not to detect whether a generative model is systematically skipping parts of a distribution. Plugging these extractors into diversity-aware metrics gives unreliable results.
The term sounds like a contradiction. The point is that the model receives conditioning in the form of feature embeddings from real training images, but no explicit class label or text prompt. It behaves like an unconditional model from the user's perspective (you can sample freely), while having an internal mechanism that makes it possible to target specific regions of the distribution if you choose. The conditioning is derived from the data itself, not from external supervision.
No model passed
the 77% mark.
Three findings across the evaluation. The headline number is 77%. The more important finding is that the metrics everyone uses can't see this problem at all.
Every state-of-the-art image generator the team evaluated fell short of covering the full diversity of its training data. The best coverage any model achieved was 77%, measured by IRS. This means that even the strongest unconditional diffusion models, trained on large curated datasets, systematically skip at least 23% of the distribution they were supposed to learn.
Importantly, these same models often have competitive FID scores. The quality of individual generated images is not the issue. The issue is which images the models choose not to generate.
The team tested whether commonly used feature extractors — Inception v3, DINOv2, CLIP — could detect diversity gaps when plugged into standard diversity metrics. None of them reliably could. This is a calibration problem: the extractors were designed for tasks other than distribution coverage assessment, and their embedding spaces do not faithfully reflect what it means for a generated set to be diverse relative to a training set.
This matters because it means researchers relying on precision, recall, or coverage metrics built on these extractors are receiving unreliable signals about their models' diversity. The IRS sidesteps this by using the retrieval task directly as the diversity proxy, bypassing the extractor calibration issue.
Against EDM-2 on every tested benchmark, DiADM improved IRS. In three benchmark settings, it exceeded the diversity of the real reference dataset itself, meaning the pseudo-conditional guidance successfully pushed the model into underrepresented regions of the distribution that even the training data sampling had underexplored. FID scores remained competitive, confirming that the diversity gains were not purchased by generating noisier or less realistic images.
DiADM is demonstrated on unconditional diffusion models. The extension to text-to-image models, conditional models, or other generative architectures is not covered in the paper. The approach depends on having access to training-time feature embeddings, which may not be available in all deployment scenarios. The diversity gains are demonstrated on standard image generation benchmarks; how the technique transfers to highly specialized domains (medical imaging, satellite imagery) is an open question.
What this means
for people who build with generative models.
The core shift is simple: quality and coverage are separate properties of a generative model, and you should measure both. FID tells you about quality. It tells you nothing about which parts of the distribution you are missing.
Where to go
from here.
If you want to measure or address coverage gaps in your own models.