This repository contains a list of papers, codes, datasets, leaderboards on the topic of Knowledge-Driven Vision-Language Pretraining. If you found any error, please don't hesitate to open an issue or pull request.
-- We will continue to add and update related papers and codes on this page (Update on Jun 08th, 2022).
(For new learners, some important papers for general NLG/KENLG.)
-
[CLIP] Learning Transferable Visual Models From Natural Language Supervision, in PMLR 2021. [pdf] [code]
-
[Transformer] Attention Is All You Need, in NeurIPS 2017. [pdf]
-
Video-and-Language Pre-training with Entity Prompts, in CVPR 2022. [pdf]
-
KD-VLP: Improving End-to-End Vision-and-Language Pretraining with Object Knowledge Distillation, in NAACL 2022 Findings. [pdf]
-
VinVL: Revisiting Visual Representations in Vision-Language Models, in CVPR 2021. [pdf]
-
E2E-VLP: End-to-End Vision-Language Pre-training Enhanced by Visual Learning, in ACL 2021. [pdf]
-
Oscar: Object-semantics aligned pre-training for vision-language tasks, in ECCV 2020. [pdf]
-
Uniter: Learning universal image-text representations, in ECCV 2020. [pdf]
-
Vl-bert: Pre-training of generic visuallinguistic representations, in ICLR 2020. [pdf]
-
Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training, in AAAI 2020. [pdf]
-
Unified vision language pre-training for image captioning and VQA, in AAAI 2020. [pdf]
-
ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks, in NeurIPS 2019. [pdf]
-
LXMERT: Learning Cross-Modality Encoder Representations from Transformers, in EMNLP 2019. [pdf]
-
Fusion of Detected Objects in Text for Visual Question Answering, in EMNLP 2019. [pdf]
-
What value do explicit high level concepts have in vision to language problems?, in CVPR 2016. [pdf]
-
Image Captioning with Semantic Attention, in CVPR 2016. [pdf]
-
ERNIE-ViL: Knowledge Enhanced Vision-Language Representations Through Scene Graph, in AAAI 2021. [pdf]
-
MUREL: Multimodal relational reasoning for visual question answering, in CVPR 2019. [pdf]
-
Relation-aware graph attention network for visual question answering, in ICCV 2019. [pdf]
-
Learning Conditioned Graph Structures for Interpretable Visual Question Answering, in NeurIPS 2018. [pdf]
-
Verbs in Action: Improving verb understanding in video-language models, arXiv. [pdf]
-
CLIP-Event: Connecting Text and Images with Event Structures, in CVPR 2022. [pdf]
-
Probing Image-Language Transformers for Verb Understanding, in ACL 2021 Findings. [pdf]
-
ActBERT: Learning Global-Local Video-Text Representations, in CVPR 2020. [pdf]
-
Multimodal understanding and reasoning for role labeling of entities in hateful memes, in Constraint@ACL2022. [pdf]
-
Learning To Recognize Procedural Activities with Distant Supervision, in CVPR 2022. [pdf]
-
MERLOT Reserve: Neural Script Knowledge from Vision and Language and Sound, in CVPR 2022. [pdf]
-
MERLOT: Multimodal Neural Script Knowledge Models, in NeurIPS 2021. [pdf]
-
End-to-end Generative Pretraining for Multimodal Video Captioning, in arXiv. [pdf]
-
Zero-Shot Anticipation for Instructional Activities, in ICCV 2019. [pdf]
-
Language Is Not All You Need: Aligning Perception with Language Models, in arXiv. [pdf]
-
Language Models with Image Descriptors are Strong Few-Shot Video-Language Learners, in arXiv. [pdf]
-
ELEVATER: A Benchmark and Toolkit for Evaluating Language-Augmented Visual Models, in arXiv. [pdf]
-
UniT: Multimodal Multitask Learning with a Unified Transformer, in arXiv. [pdf]
-
Visual Commonsense in Pretrained Unimodal and Multimodal Models, in NAACL 2022. [pdf]
-
VidLanKD: Improving Language Understanding via Video-Distilled Knowledge Transfer, in NeurIPS 2021. [pdf]
-
VX2TEXT: End-to-End Learning of Video-Based Text Generation From Multimodal Inputs?, in CVPR 2021. [pdf]
-
M3P: Learning Universal Representations via Multitask Multilingual Multimodal Pre-training, in CVPR 2021. [pdf]
-
Probing Inter-modality: Visual Parsing with Self-Attention for Vision-Language Pre-training, in NeurIPS 2021. [pdf]
-
Vokenization: Improving Language Understanding with Contextualized, Visual-Grounded Supervision, in EMNLP 2020. [pdf]