This repository includes basic concepts of Natural Language Processing, textbooks and blogs of good reputation, popular papers and so on.
This is also the Natural Language Processing part of Machine Learning Resources created by a group of people including jindongwang.
Contributors are welcomed to work together and make it BETTER!
-
Linear Algebra
-
Matrix Analysis
-
Convex Optimization
- The Elements of Statistical Learning(ESL) - HTF
- CS228 Probabilistic Graphical Model - Stanford
- 10708 Probabilistic Graphical Model - CMU
- Deep Learning - Ian Goodfellow, Yoshua Bengio, Aaron Courville
- CS231n Convolutional Neural Networks for Visual Recognition - Stanford
- Foundations of Statistical Natural Language Processing - Chris Manning
- Speech and Language Processing - Daniel Jurafsky and James H. Martin
- 统计学习方法 - 李航
- Advanced Natural Language Processing - MIT
- CS 224n Natural Language Processing with Deep Learning - Stanford
- Deep Learning for NLP at Oxford with Deepmind - Oxford
- 11-747 NN4NLP
- 11-737 Multilingual NLP
- Some Knowledge about Machine Learning
- A list of datasets
-
Probalistic Graphical Model
- Hidden Markov Model
- Conditional Random Fields
-
Topic Model
- Latent Dirichlet Allocation(paper)
-
Deep Learning Model
- Long Short Term Memory(LSTM) Sepp Hochreiter, 1997
- Interpretation Omer Levy, UWashington, 2018
- Recurrent Neuron Network - Seq2Seq(Tensorflow Tutorial) - Machine Translation Tensorflow implement
- Convolutional Neuron Network
- Attention Model
- Overview(Chinese)
- Generative Adversial Network(GAN)
- Transformer
- Training Tips
- Bidirectional Encoder Representation from Transformers(BERT) Jacob Devlin, Google 2018
- Long Short Term Memory(LSTM) Sepp Hochreiter, 1997
- Tensorflow implement on RNN and undocumented features
- The Unreasonable Effectiveness of Recurrent Neural Networks
Category of areas is based on tracks in ACL 2018, ACL 2020, EMNLP 2020
- Task
- Summerization
- Opinion Summarization
- Evaluation
- Model
- Extractive
- Generative
- Hybrid
- Dataset
- XSum, EMNLP2018 [paper]
- CNN/DailyMail
- NEWSROOM
- Multi-News
- Gigaword
- arXiv
- PubMed
- BIGPATENT
- WikiHow
- Reddit TIFU (long, short)
- AESLC
- BillSum
- Model
- Word2Vec
- Pre-trained Embedding
- Glove
- word2vec
- FastText
- Contextual Word Embedding
- ELMo
- GPT
- BERT
- XLNet
- BART
- T-5
- Task
- Word Segmentation
- Syntactic Parsing
- Model
- Hidden Markov Model (HMM)
- Conditional Random Fields (CRFs)
- Finetuned Language Models
- Task
- Constituency Parsing
- Dependency Parsing
- Visual Grounded Syntactic Aquisition
- Model
- Dataset
-
Tasks
- Semantic Parsing
- AMR-to-text
- Text-to-AMR
- Table-to-text
- Code Generation
- Semantic Parsing
-
Model
-
Dataset
- Tasks
- Word Sense Disambiguation
- Tasks
- Topic Extraction
- Sentimental Extraction
- Aspect Extraction
- Task
- Machine Translation
- Non-autogressive Machine Translation
- Word-alignment
- Model
- Dataset
- WMT
- Task
- SPAM Classification
- Sentiment Analysis
- Model
- Dataset
- Task
- Dataset
- CNN/DailyMail
- SQuAD
- Benchmark: F1-86.967 BERT + Synthetic Self-Training (ensemble) Jan 10, 2019
- RACE
- Benchmark: RACE-83.2 RACEC-M-86.5 RACE-H-81.3 RoBERTa July 2019
- Task
- Code-Switching
- Mutilingual Translation
- Model
- Dataset
- Tasks
- Model
- N-gram
- ELMo, NAACL2018
- GPT
- GPT-2, arXiv2019
- GPT-3, NeurIPS2020
- BERT, NAACL2019
- RoBERTa, arXiv 2019
- SpanBERT, TACL 2020
- Efficient
- Domain Specific
- Langauge Specific [Latin BERT, German BERT, Italian BERT, Chinese BERT]
- BERTology, TACL 2020
- XLNet, NeurIPS2019
- MASS, ICML2019 [code]
- ELECTRA, ICLR2020 [code]
- T5, JMLR2020
- BART, ACL2020
- Finetuning
- Invasive (LM not fixed)
- Regular finetuning
- Re-initlization for few-shot learning ICLR2021
- Non-invasive (LM fixed)
- Prefix-tuning, arXiv2021
- Invasive (LM not fixed)
- Language Model as
- BERTScore, ICLR2020
- Few-shot learner
- Bias in few-shot examples, arXiv2021
- Knowledge base EMNLP2019, Tutorial@AAAI2021
- Dataset
- CommonCrawl
- Wiki-Text
- STORIES
- C4 [huggingface]
-
Tasks
- Fact Verification
- Commonsense Reasoning
- Word-level Rationales
- Factually Consistent Generation
-
Model
-
Dataset
- Tasks
- Grammartical Error Correction (GEC) [BEA@NAACL2018, BEA@ACL2019, BEA@ACL2020, BEA@EACL2021]
- Lexical Substitution
- Lexical Simplification
- Model
- Dataset
- Huggingface Dataset
- GLUE
- SuperGLUE
- Leaderboards
- Machine Learning Package and Framework
- sciki-learn
- Tensorflow
- Caffe2
- Pytorch
- MXNet
- NLTK
- gensim
- jieba
- Stanford NLP
- Transformers (huggingface)
如果你对本项目感兴趣,非常欢迎你加入!
- 正常参与:请直接fork、pull都可以
- 如果要上传文件:请不要直接上传到项目中,否则会造成git版本库过大。正确的方法是上传它的超链接。如果你要上传的文件本身就在网络中(如paper都会有链接),直接上传即可;如果是自己想分享的一些文件、数据等,鉴于国内网盘的情况,请按照如下方式上传:
快速了解github协同工作 Learn how to collaborate through github
及时更新fork项目 Update through fork