Marco Jiralerspong, Joey Bose, Ian Gemp, Chongli Qin, Yoram Bachrach, Gauthier Gidel
PyTorch implementation of FLD, a new metric for generative models that is sensitive to overfitting/memorization. Also contains support for other metrics (FID, KID, Precision, Recall, etc.) with support for DINOv2, Inception-v3 and CLIP feature spaces. Allows for computation of metrics from within your Python code (e.g. directly from the generative model).
Left: Trichotomic evaluation of generative models. FLD evaluates all 3 axes concurrently. Right: Copycat, a generative model that only outputs copies of the training set, significantly beats out SOTA models when evaluated using Test FID.FLD is a comprehensive, sample-based metric that is sensitive to sample fidelity, diversity and novelty (i.e. overfitting and whether a model is memorizing the training set). Relies on density estimation in a feature space to compute the perceptual likelihood of generated samples.
- Lower is better
- Roughly between [0, 100] where 0 corresponds to a perfect model
Currently, mainly built for images but all of the metrics can be extended to other modalities (audio, video, tabular, etc.) given appropriate feature extractors.
đź“Ł UPDATE: The Feature Likelihood Score (FLS) has been renamed to Feature Likelihood Divergence (FLD). If your code uses the old version, fls
can be installed using::
pip install git+https://github.com/marcojira/fld.git@b9db224
pip install git+https://github.com/marcojira/fld.git
from torchvision.datasets.cifar import CIFAR10
from fld.features.DINOv2FeatureExtractor import DINOv2FeatureExtractor
from fld.metrics.FLD import FLD
feature_extractor = DINOv2FeatureExtractor()
train_feat = feature_extractor.get_features(CIFAR10(train=True, root="data", download=True))
test_feat = feature_extractor.get_features(CIFAR10(train=False, root="data", download=True))
# From a directory of generated images
gen_feat = feature_extractor.get_dir_features("/path/to/images", extension="png")
fld_val = FLD().compute_metric(train_feat, test_feat, gen_feat)
print(f"FLD: {fld_val:.3f}")
# For other metrics
from fld.metrics.FID import FID
fid_val = FID().compute_metric(train_feat, None, gen_feat)
print(f"FID: {fid_val:.3f}")
Note: While FLD is originally designed to evaluate sample novelty through the use of a test set, it can also be used without a test set. To do so, simply pass a small proportion of the train set to the metric instead of the training features. Note that doing so will yield a metric that is less sensitive to overfitting (as training samples that are memorized will get a high likelihood).
IMPORTANT: For a comparable evaluation of FLD on CIFAR10, FFHQ and ImageNet, use the following settings:
- 10k generated samples
- DINOv2 feature space
- CIFAR10:
- Train: Entire set (50k images)
- Test: Entire set (10k images)
- FFHQ:
- Train: First 60k images (of 70k)
- Test: Last 10k images (i.e. the validation set described in https://github.com/NVlabs/ffhq-dataset)
- ImageNet (using the following ImageNet subset):
- Train: Entire train set (from the above)
- Test: Entire test set (from the above)
To only evaluate the degree of overfitting, we recommend looking at the FLD generalization gap (i.e. the difference between train and test FLD). The more this value is negative, the more your model is overfitting. This can be done as follows:
from fld.metrics.FLD import FLD
train_fld = FLD(eval_feat="gap").compute_metric(train_feat, test_feat, gen_feat)
print(f"Generalization Gap FLD: {gen_gap:.3f}")
The evaluation pipeline goes Data => Features => Metrics.
To get the train, test or gen features, data can be provided to the feature extractors in the following formats
from fld.features.DINOv2FeatureExtractor import DINOv2FeatureExtractor
feature_extractor = DINOv2FeatureExtractor()
# torch.utils.Dataset (e.g. torchvision.datasets but could also be your own custom class)
from torchvision.datasets.cifar import CIFAR10
feat = feature_extractor.get_features(CIFAR10(train=True, root="data", download=True))
# Directory of samples (will create a dataset from all images in that directory that match `extension`, found recursively)
feat = feature_extractor.get_dir_features("/path/to/images", extension="jpg")
# Image tensor of float32 of size N x C x H x W in range [0, 1]
img_tensor = torch.rand((10_000, 3, 32, 32))
feat = feature_extractor.get_tensor_features(img_tensor)
# Generate function for model that returns tensor of float32 of size B x C x H x W in range [0, 1]
def gen_fn(x):
x = torch.randn(128, 100)
return model(x)
feat = feature_extractor.get_model_features(gen_fn, num_samples=10_000)
The FeatureExtractor
class is designed to map images to the given feature space where metrics are computed.
Currently supports DINOv2, CLIP and Inception-v3 (DINOv2 is recommended)
from fld.features.DINOv2FeatureExtractor import DINOv2FeatureExtractor
from fld.features.CLIPFeatureExtractor import CLIPFeatureExtractor
from fld.features.InceptionFeatureExtractor import InceptionFeatureExtractor
feature_extractor = CLIPFeatureExtractor() # or InceptionFeatureExtractor()
Feature extraction can be relatively computationally expensive. By caching features, you can save yourselves from recomputing features unnecessarily. For example, if you want to evaluate model performance over the course of training, the train/test set features should only be computed once at the start of training and can then be retrieved from the cache for subsequent evaluations. To do so:
# First specify where the features should be cached when creating the feature extractor
feature_extractor = DINOv2FeatureExtractor(save_path="/path/to/save/features")
# Then, pass `name` when getting features you want to cache
train_feat = feature_extractor.get_features(CIFAR10(train=True, root="data", download=True), name="CIFAR10_train") # Will cache
test_feat = feature_extractor.get_features(CIFAR10(train=False, root="data", download=True)) # Won't cache
gen_feat = feature_extractor.get_dir_features("/path/to/images", extension="png", name="CIFAR10_gen_epoch_0") # Will cache
# Finally, if you get features with the same `name` after that at any point, will retrieve from cache
train_feat = feature_extractor.get_features(CIFAR10(train=True, root="data", download=True), name="CIFAR10_train")
The FeatureExtractor
class can be extended to use your own feature extractors (see example below, need to change everything where there's a SPECIFY
).
import torch
import torchvision.transforms as transforms
from fld.features.ImageFeatureExtractor import ImageFeatureExtractor
class CustomFeatureExtractor(ImageFeatureExtractor):
def __init__(self, save_path=None):
self.name = f"Custom" # SPECIFY
super().__init__(save_path)
self.features_size = 768 # SPECIFY
# SPECIFY (function applied to inputs)
self.preprocess = lambda x: x
# SPECIFY (function that takes batch of preprocessed inputs and returns tensor of features)
def get_feature_batch(self, img_batch):
pass
# Can then use the same caching functionality of other extractors
feature_extractor = CustomFeatureExtractor(save_path="path/to/features")
gen_feat = feature_extractor.get_dir_features("/path/to/images", extension="png", name="CIFAR10_gen_epoch_0")```
Currently supported:
- FLD
- AuthPct (the % of authentic samples as defined by Authenticity)
-
CTTest (the
$C_T$ test statistic) - FID
- KID
- Precision
- Recall
To compute other metrics:
# All metrics have the function `.compute_metric(train_feat, test_feat, gen_feat)`
""" AuthPct """
from fld.metrics.AuthPct import AuthPct
AuthPct().compute_metric(train_feat, test_feat, gen_feat)
""" CTTest """
from fld.metrics.CTTest import CTTest
CTTest().compute_metric(train_feat, test_feat, gen_feat)
""" FID """
from fld.metrics.FID import FID
# Default FID (50k samples compared to train set)
FID().compute_metric(train_feat, None, gen_feat)
# Test FID
FID(ref_feat="test").compute_metric(None, test_feat, gen_feat)
""" FLD """
from fld.metrics.FLD import FLD
# To get Train FLD instead of Test FLD
FLD(eval_feat="train").compute_metric(train_feat, test_feat, gen_feat)
""" KID """
from fld.metrics.KID import KID
# Like FID, can get either Train or Test KID
KID(ref_feat="test").compute_metric(None, test_feat, gen_feat)
""" Precision/Recall """
from fld.metrics.PrecisionRecall import PrecisionRecall
PrecisionRecall(mode="Precision").compute_metric(train_feat, None, gen_feat) # Default precision
PrecisionRecall(mode="Recall", num_neighbors=5).compute_metric(train_feat, None, gen_feat) # Recall with k=5
For each generated sample
from fld.sample_evaluation import sample_memorization_scores
memorization_scores = sample_memorization_scores(train_feat, test_feat, gen_feat)
**Note: Potential of running into memory issues when passing too many generated samples **
Instead of estimating the density of the generated samples, we can estimate the density of the test set and use it to get the likelihood of the
from fld.sample_evaluation import sample_quality_scores
quality_scores = sample_quality_scores(train_feat, test_feat, gen_feat)
Note: This is somewhat dependent on the feature space (e.g. some image classes are naturally closer in some feature spaces -> higher likelihood)
If you find this repository useful, please consider citing it:
@misc{jiralerspong2023feature,
title={Feature Likelihood Score: Evaluating Generalization of Generative Models Using Samples},
author={Marco Jiralerspong and Avishek Joey Bose and Ian Gemp and Chongli Qin and Yoram Bachrach and Gauthier Gidel},
year={2023},
eprint={2302.04440},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
The authors acknowledge the material support of NVIDIA in the form of computational resources. Joey Bose was generously supported by an IVADO Ph.D. fellowship.