📖 Paper | 🤗 Demo | 🤖 ModelScope | Checkpoints | Datasets
ONE-PEACE is a general representation model across vision, audio, and language modalities, Without using any vision or language pretrained model for initialization, ONE-PEACE achieves leading results in vision, audio, audio-language, and vision-language tasks. Furthermore, ONE-PEACE possesses a strong emergent zero-shot retrieval capability, enabling it to align modalities that are not paired in the training data.
Below shows the architecture and pretraining tasks of ONE-PEACE. With the scaling-friendly architecture and modality-agnostic tasks, ONE-PEACE has the potential to expand to unlimited modalities.
We provide the online demo in Huggingface Spaces. In this demo, you can combine multiple modalities to retrieve related images, such as audio-to-image, audio+text-to-image, audio+image-to-image, and even audio+image+text-to-image.
- 2023.7.20: Released the visual grounding API, you can use it to locate objects from the picture.
- 2023.6.23: Released vision tasks fine-tuning scripts and checkpoints. See guidance for vision tasks for more details.
- 2023.6.04: Released the pretraining scripts. See guidance for pretraining for more details.
- 2023.5.30: Released the finetuned checkpoints and scripts for audio(-language) tasks.
- 2023.5.29: Released the finetuned checkpoints for vision-language tasks.
- 2023.5.27: 🔥 We have provided the multimodal retrieval demo in huggingface spaces. Have Fun!
- 2023.5.25: Released the multimodal embedding API, which enables the quick extraction for image, audio and text representations.
- 2023.5.23: Released the pretrained checkpoint, as well as finetuning & inference scripts for vision-language tasks.
- 2023.5.19: Released the paper and code. Pretrained & finetuned checkpoints, training & inference scripts, as well as demos will be released as soon as possible.
We list the parameters and pretrained checkpoints of ONE-PEACE below. Note that ONE-PEACE can be disassembled into different branches to handle different tasks. We also provide the vision-branch of ONE-PEACE, which can be used to perform vision tasks.
Model | Ckpt | Params | Hidden size | Intermediate size | Attention heads | Layers |
---|---|---|---|---|---|---|
ONE-PEACE | Download | 4B | 1536 | 6144 | 24 | 40 |
ONE-PEACE (Vision Branch) | Download | 1.5B | 1536 | 6144 | 24 | 40 |
Task | Image classification | Semantic Segmentation | Object Detection (w/o Object365) | Video Action Recognition |
---|---|---|---|---|
Dataset | Imagenet-1K | ADE20K | COCO | Kinetics 400 |
Split | val | val | val | val |
Metric | Acc. | mIoUss / mIoUms | APbox / APmask | Top-1 Acc. / Top-5 Acc. |
ONE-PEACE | 89.8 | 62.0 / 63.0 | 60.4 / 52.9 | 88.1 / 97.8 |
Task | Audio-Text Retrieval | Audio Classification | Audio Question Answering | |||||
---|---|---|---|---|---|---|---|---|
Dataset | AudioCaps | Clotho | ESC-50 | FSD50K | VGGSound (Audio-Visual) | AVQA | ||
Split | test | evaluation | full | eval | test | val | ||
Metric | T2A R@1 | A2T R@1 | T2A R@1 | A2T R@1 | Zero-shot Acc. | MAP | Acc. | Acc. |
ONE-PEACE | 42.5 | 51.0 | 22.4 | 27.1 | 91.8 | 69.7 | 68.2 | 92.2 |
Task | Image-Text Retrieval (w/o ranking) | Visual Grounding | VQA | Visual Reasoning | |||||
---|---|---|---|---|---|---|---|---|---|
Dataset | COCO | Flickr30K | RefCOCO | RefCOCO+ | RefCOCOg | VQAv2 | NLVR2 | ||
Split | test | test | val / testA / testB | val / testA / testB | val-u / test-u | test-dev / test-std | dev / test-P | ||
Metric | I2T R@1 | T2I R@1 | I2T R@1 | T2I R@1 | Acc@0.5 | Acc. | Acc. | ||
ONE-PEACE | 84.1 | 65.4 | 97.6 | 89.6 | 92.58 / 94.18 / 89.26 | 88.77 / 92.21 / 83.23 | 89.22 / 89.27 | 82.6 / 82.5 | 87.8 / 88.3 |
- 3.6 <= Python <=3.10
- Pytorch >= 1.10.0 (recommend 1.13.1)
- CUDA Version >= 10.2 (recommend 11.6)
- Install required packages:
git clone https://github.com/OFA-Sys/ONE-PEACE
cd ONE-PEACE
pip install -r requirements.txt
- For faster training install Apex library (optional):
git clone https://github.com/NVIDIA/apex
cd apex && pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" --global-option="--distributed_adam" --global-option="--deprecated_fused_adam" ./
- Install Xformers library to use Memory-efficient attention (optional):
conda install xformers -c xformers
- Install FlashAttention library to use faster LayerNorm (optional):
git clone --recursive https://github.com/HazyResearch/flash-attention
cd flash-attention && pip install .
cd csrc/layer_norm && pip install .
See datasets.md and checkpoints.md.
We provide a simple code snippet to show how to use the API for ONE-PEACE.
We use ONE-PEACE to compute embeddings for text, images, and audio, as well as their similarities:
import torch
from one_peace.models import from_pretrained
device = "cuda" if torch.cuda.is_available() else "cpu"
# "ONE-PEACE" can also be replaced with ckpt path
model = from_pretrained("ONE-PEACE", device=device, dtype="float32")
# process raw data
src_tokens = model.process_text(["cow", "dog", "elephant"])
src_images = model.process_image(["assets/dog.JPEG", "assets/elephant.JPEG"])
src_audios, audio_padding_masks = model.process_audio(["assets/cow.flac", "assets/dog.flac"])
with torch.no_grad():
# extract normalized features
text_features = model.extract_text_features(src_tokens)
image_features = model.extract_image_features(src_images)
audio_features = model.extract_audio_features(src_audios, audio_padding_masks)
# compute similarity
i2t_similarity = image_features @ text_features.T
a2t_similarity = audio_features @ text_features.T
print("Image-to-text similarities:", i2t_similarity)
print("Audio-to-text similarities:", a2t_similarity)
We use ONE-PEACE to perform visual grounding on anime pictures:
import torch
import cv2
from one_peace.models import from_pretrained
device = "cuda" if torch.cuda.is_available() else "cpu"
model = from_pretrained(
"ONE-PEACE_Grounding",
model_type="one_peace_classify",
device=device,
dtype="float32"
)
# process raw data
image_text_list = [
("assets/pokemons.jpg", "a blue turtle-like pokemon with round head"),
("assets/pokemons.jpg", "Bulbasaur"),
("assets/pokemons.jpg", "Charmander"),
("assets/pokemons.jpg", "Squirtle"),
("assets/one_piece.jpeg", "Brook"),
("assets/one_piece.jpeg", "Franky"),
("assets/one_piece.jpeg", "Monkey D. Luffy"),
("assets/one_piece.jpeg", "Nami"),
("assets/one_piece.jpeg", "Nico Robin"),
("assets/one_piece.jpeg", "Roronoa Zoro"),
("assets/one_piece.jpeg", "Tony Tony Chopper"),
("assets/one_piece.jpeg", "Usopp"),
("assets/one_piece.jpeg", "Vinsmoke Sanji"),
]
(src_images, image_widths, image_heights), src_tokens = model.process_image_text_pairs(
image_text_list, return_image_sizes=True
)
with torch.no_grad():
# extract features
vl_features = model.extract_vl_features(src_images, src_tokens).sigmoid()
# extract coords
vl_features[:, ::2] *= image_widths.unsqueeze(1)
vl_features[:, 1::2] *= image_heights.unsqueeze(1)
coords = vl_features.cpu().tolist()
# display results
for i, image_text_pair in enumerate(image_text_list):
image, text = image_text_pair
img = cv2.imread(image)
cv2.rectangle(
img,
(int(coords[i][0]), int(coords[i][1])),
(int(coords[i][2]), int(coords[i][3])),
(0, 255, 0),
3
)
cv2.imshow(text, img)
cv2.waitKey(3500)
cv2.destroyAllWindows()
We use ONE-PEACE to perform audio classification:
import torch
import json
from one_peace.models import from_pretrained
id2label = json.load(open("assets/vggsound_id2label.json"))
device = "cuda" if torch.cuda.is_available() else "cpu"
model = from_pretrained(
"ONE-PEACE_VGGSound",
model_type="one_peace_classify",
device=device,
dtype="float32"
)
# process audio
audio_list = ["assets/cow.flac", "assets/dog.flac"]
src_audios, audio_padding_masks = model.process_audio(audio_list)
with torch.no_grad():
# extract audio features
audio_logits = model.extract_audio_features(src_audios, audio_padding_masks)
print(audio_logits.size())
predict_label_ids = audio_logits.argmax(1).cpu().tolist()
for audio, predict_label_id in zip(audio_list, predict_label_ids):
predict_label = id2label[str(predict_label_id)]
print('audio: {}, predict label: {}'.format(audio, predict_label))
If you are not satisfied with only using the API, we offer comprehensive training and inference instructions for audio & multimodal and vision tasks.
- Fairseq A sequence modeling toolkit with flexible configuration and highly extensible code structure.
- xFormers A toolbox to accelerate research on Transformers.
- FlashAttention A repository that provides the official implementation of FlashAttention, which greatly speeds up multi-head attention.
- Apex A repository that provides useful model acceleration and memory optimization techniques.
Feel free to submit GitHub issues or pull requests. Welcome to contribute to our project!
To contact us, never hestitate to send an email to zheluo.wp@alibaba-inc.com
or saimeng.wsj@alibaba-inc.com
!
If you find our paper and code useful in your research, please consider giving a star ⭐ and citation 📝 :)
@article{wang2023one,
title={ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities},
author={Wang, Peng and Wang, Shijie and Lin, Junyang and Bai, Shuai and Zhou, Xiaohuan and Zhou, Jingren and Wang, Xinggang and Zhou, Chang},
journal={arXiv preprint arXiv:2305.11172},
year={2023}
}