Production-ready
data processing made
easy
and
shareable
Explore the Fondant docs »
This repository contains the code to build a CLIP index for the Datacomp-12.8M dataset with Fondant. It should be straightforward to apply it to a different dataset.
The resulting embedded dataset and index have been published on the Hugging Face Hub here. The data repository is structured as follows:
- data/: The dataset containing ids, urls, and CLIP embeddings
- faiss: The faiss index
- id_mapping/: The mapping of the faiss ids to the original urls
Continue reading below to learn:
- Why we need a CLIP index
- How to use the CLIP index
- Which steps are needed to create the index
- The execution details of our run
- What's next
Large (image) datasets are often unwieldy to use due to their sheer size. Assume for instance that we would like to extract all the cat images from such a dataset. We would have to look at every image to classify if it's a cat image or not. And if we want to extract all the dog images next, we again need to look at every image.
Instead, we can look at every image once, and calculate a (CLIP) embedding representing its content. Combining these embeddings into an index, we can efficiently search through the dataset with a query, finding specific images, without having to look at each one.
This is what LAION did for their LAION-5b dataset, which made it possible to use, like we did in our ControlNet example. Unfortunately, the LAION-5b dataset and index have been taken offline (temporarily) and there aren't any alternatives. This is why we built an index for the Datacomp-12M dataset. While it is a lot smaller than LAION-5b, it should already enable a lot of use cases again, and can hopefully be the start towards building indices for more and larger datasets.
We leveraged Fondant to generate the CLIP index and published the pipeline in this git
repository. You can find it in pipeline.py
.
The pipeline consists of 4 steps:
- A
load_from_hf_hub
operation that loads the datacomp_small dataset from huggingface into the Fondant workspace and format. - A
download_images
operation which downloads the actual images from the urls in the dataset. - A
embed_images
operation which embeds the downloaded images using a CLIP model. - A
write_to_file
operation which writes the original urls and generated embeddings to the chosen destination.
You can run it by installing fondant:
pip install fondant==0.11.0
and running it with your runner of choice:
fondant run <runner> pipeline.py
Check the fondant documentation for more info.
After running the pipeline, we used autofaiss
to build the
CLIP index. You can use the included wrapper script build_index.py
.
Once you have created the index, you can explore your index and validate that everything is
working using the exploration.ipynb
notebook.
The easiest way to use the index, is using Fondant. Fondant offers reusable operations which allow you to query the index with your data:
To see how it can be used in an end-to-end example, check our ControlNet example which uses the index to create a dataset to fine-tune a ControlNet model on a specific domain.
There are other open source tools which allow you to leverage a CLIP index. We can recommend clip-retrieval which lets you set up a service hosting the index accessible by API.
For the execution details of our 12.8M run, check the announcement.
With Fondant we aim to make data building collaborative, and we will share more features built on top of the Datacomp datasets to showcase this in the future. To stay up to date, join our Discord.
Based on the popularity and feedback we receive on this 12.8M index, we might generate a CLIP index for the datacomp-128M dataset. If there are other datasets you are interested in, or want to generate an index for a different dataset yourself, please let us know in our Discord.