Skip to content

limanling/KnowledgeVL-Reading

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 

Repository files navigation

Knowledge-Driven Vision-Language Pretraining Reading

This repository contains a list of papers, codes, datasets, leaderboards on the topic of Knowledge-Driven Vision-Language Pretraining. If you found any error, please don't hesitate to open an issue or pull request.

-- We will continue to add and update related papers and codes on this page (Update on Jun 08th, 2022).

Basic vision-langauge pretraining papers and codes

(For new learners, some important papers for general NLG/KENLG.)

  • [CLIP] Learning Transferable Visual Models From Natural Language Supervision, in PMLR 2021. [pdf] [code]

  • [Transformer] Attention Is All You Need, in NeurIPS 2017. [pdf]

Vision-langauge pretraining: entity knowledge

  • Video-and-Language Pre-training with Entity Prompts, in CVPR 2022. [pdf]

  • KD-VLP: Improving End-to-End Vision-and-Language Pretraining with Object Knowledge Distillation, in NAACL 2022 Findings. [pdf]

  • VinVL: Revisiting Visual Representations in Vision-Language Models, in CVPR 2021. [pdf]

  • E2E-VLP: End-to-End Vision-Language Pre-training Enhanced by Visual Learning, in ACL 2021. [pdf]

  • Oscar: Object-semantics aligned pre-training for vision-language tasks, in ECCV 2020. [pdf]

  • Uniter: Learning universal image-text representations, in ECCV 2020. [pdf]

  • Vl-bert: Pre-training of generic visuallinguistic representations, in ICLR 2020. [pdf]

  • Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training, in AAAI 2020. [pdf]

  • Unified vision language pre-training for image captioning and VQA, in AAAI 2020. [pdf]

  • ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks, in NeurIPS 2019. [pdf]

  • LXMERT: Learning Cross-Modality Encoder Representations from Transformers, in EMNLP 2019. [pdf]

  • Fusion of Detected Objects in Text for Visual Question Answering, in EMNLP 2019. [pdf]

  • What value do explicit high level concepts have in vision to language problems?, in CVPR 2016. [pdf]

  • Image Captioning with Semantic Attention, in CVPR 2016. [pdf]

Vision-langauge pretraining: relational knowledge

  • ERNIE-ViL: Knowledge Enhanced Vision-Language Representations Through Scene Graph, in AAAI 2021. [pdf]

  • MUREL: Multimodal relational reasoning for visual question answering, in CVPR 2019. [pdf]

  • Relation-aware graph attention network for visual question answering, in ICCV 2019. [pdf]

  • Learning Conditioned Graph Structures for Interpretable Visual Question Answering, in NeurIPS 2018. [pdf]

Vision-langauge pretraining: event knowledge

  • Verbs in Action: Improving verb understanding in video-language models, arXiv. [pdf]

  • CLIP-Event: Connecting Text and Images with Event Structures, in CVPR 2022. [pdf]

  • Probing Image-Language Transformers for Verb Understanding, in ACL 2021 Findings. [pdf]

  • ActBERT: Learning Global-Local Video-Text Representations, in CVPR 2020. [pdf]

  • Multimodal understanding and reasoning for role labeling of entities in hateful memes, in Constraint@ACL2022. [pdf]

Vision-langauge pretraining: procedural knowledge

  • Learning To Recognize Procedural Activities with Distant Supervision, in CVPR 2022. [pdf]

  • MERLOT Reserve: Neural Script Knowledge from Vision and Language and Sound, in CVPR 2022. [pdf]

  • MERLOT: Multimodal Neural Script Knowledge Models, in NeurIPS 2021. [pdf]

  • End-to-end Generative Pretraining for Multimodal Video Captioning, in arXiv. [pdf]

  • Zero-Shot Anticipation for Instructional Activities, in ICCV 2019. [pdf]

Vision-langauge pretraining: Language Model (LM) knowledge

  • Language Is Not All You Need: Aligning Perception with Language Models, in arXiv. [pdf]

  • Language Models with Image Descriptors are Strong Few-Shot Video-Language Learners, in arXiv. [pdf]

  • ELEVATER: A Benchmark and Toolkit for Evaluating Language-Augmented Visual Models, in arXiv. [pdf]

  • UniT: Multimodal Multitask Learning with a Unified Transformer, in arXiv. [pdf]

  • Visual Commonsense in Pretrained Unimodal and Multimodal Models, in NAACL 2022. [pdf]

  • VidLanKD: Improving Language Understanding via Video-Distilled Knowledge Transfer, in NeurIPS 2021. [pdf]

  • VX2TEXT: End-to-End Learning of Video-Based Text Generation From Multimodal Inputs?, in CVPR 2021. [pdf]

  • M3P: Learning Universal Representations via Multitask Multilingual Multimodal Pre-training, in CVPR 2021. [pdf]

  • Probing Inter-modality: Visual Parsing with Self-Attention for Vision-Language Pre-training, in NeurIPS 2021. [pdf]

  • Vokenization: Improving Language Understanding with Contextualized, Visual-Grounded Supervision, in EMNLP 2020. [pdf]

Vision-langauge pretraining: external knowledge

  • KRIT: Knowledge-Reasoning Intelligence in vision-language Transformer. [pdf]

  • KVL-BERT: Knowledge enhanced visual-and-linguistic bert for visual commonsense reasoning, in Knowledge-Based Systems (2022). [pdf]

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published