Models to perform neural summarization (extractive and abstractive) using machine learning transformers and a tool to convert abstractive summarization datasets to the extractive task.
TransformerSum
is a library that aims to make it easy to train, evaluate, and use machine learning transformer models that perform automatic summarization. It features tight integration with huggingface/transformers which enables the easy usage of a wide variety of architectures and pre-trained models. There is a heavy emphasis on code readability and interpretability so that both beginners and experts can build new components. Both the extractive and abstractive model classes are written using pytorch_lightning, which handles the PyTorch training loop logic, enabling easy usage of advanced features such as 16-bit precision, multi-GPU training, and much more. TransformerSum
supports both the extractive and abstractive summarization of long sequences (4,096 to 16,384 tokens) using the longformer (extractive) and LongformerEncoderDecoder (abstractive), which is a combination of BART (paper) and the longformer. TransformerSum
also contains models that can run on resource-limited devices while still maintaining high levels of accuracy. Models are automatically evaluated with the ROUGE metric but human tests can be conducted by the user.
Check out the documentation for usage details.
-
For extractive summarization, compatible with every huggingface/transformers transformer encoder model.
-
For abstractive summarization, compatible with every huggingface/transformers EncoderDecoder and Seq2Seq model.
-
Currently, 10+ pre-trained extractive models available to summarize text trained on 3 datasets (CNN-DM, WikiHow, and ArXiv-PebMed).
-
Contains pre-trained models that excel at summarization on resource-limited devices: On CNN-DM,
mobilebert-uncased-ext-sum
achieves about 97% of the performance of BertSum while containing 4.45 times fewer parameters. It achieves about 94% of the performance of MatchSum (Zhong et al., 2020), the current extractive state-of-the-art. -
Contains code to train models that excel at summarizing long sequences: The longformer (extractive) and LongformerEncoderDecoder (abstractive) can summarize sequences of lengths up to 4,096 tokens by default, but can be trained to summarize sequences of more than 16k tokens.
-
Integration with huggingface/nlp means any summarization dataset in the
nlp
library can be used for both abstractive and extractive training. -
"Smart batching" (extractive) and trimming (abstractive) support to not perform unnecessary calculations (speeds up training).
-
Use of
pytorch_lightning
for code readability. -
Extensive documentation.
-
Three pooling modes (convert word vectors to sentence embeddings): mean or max of word embeddings in addition to the CLS token.
All pre-trained models (including larger models and other architectures) are located in the documentation. The below is a fraction of the available models.
Name | Dataset | Comments | R1/R2/RL/RL-Sum | Model Download | Data Download |
---|---|---|---|---|---|
mobilebert-uncased-ext-sum | CNN/DM | None | 42.01/19.31/26.89/38.53 | Model | CNN/DM Bert Uncased |
distilroberta-base-ext-sum | CNN/DM | None | 42.87/20.02/27.46/39.31 | Model | CNN/DM Roberta |
roberta-base-ext-sum | CNN/DM | None | 43.24/20.36/27.64/39.65 | Model | CNN/DM Roberta |
mobilebert-uncased-ext-sum | WikiHow | None | 30.72/8.78/19.18/28.59 | Model | WikiHow Bert Uncased |
distilroberta-base-ext-sum | WikiHow | None | 31.07/8.96/19.34/28.95 | Model | WikiHow Roberta |
roberta-base-ext-sum | WikiHow | None | 31.26/09.09/19.47/29.14 | Model | WikiHow Roberta |
mobilebert-uncased-ext-sum | arXiv-PubMed | None | 33.97/11.74/19.63/30.19 | Model | arXiv-PubMed Bert Uncased |
distilroberta-base-ext-sum | arXiv-PubMed | None | 34.70/12.16/19.52/30.82 | Model | arXiv-PubMed Roberta |
roberta-base-ext-sum | arXiv-PubMed | None | 34.81/12.26/19.65/30.91 | Model | arXiv-PubMed Roberta |
Name | Dataset | Comments | Model Download |
---|---|---|---|
longformer-encdec-8192-bart-large-abs-sum | arXiv-PubMed | None | Not yet... |
Installation is made easy due to conda environments. Simply run this command from the root project directory: conda env create --file environment.yml
and conda will create and environment called transformersum
with all the required packages from environment.yml. The spacy en_core_web_sm
model is required for the convert_to_extractive.py script to detect sentence boundaries.
- Clone this repository:
git clone https://github.com/HHousen/transformersum.git
. - Change to project directory:
cd transformersum
. - Run installation command:
conda env create --file environment.yml
. - (Optional) If using the convert_to_extractive.py script then download the
en_core_web_sm
spacy model:python -m spacy download en_core_web_sm
.
Hayden Housen – haydenhousen.com
Distributed under the GNU General Public License v3.0. See the LICENSE for more information.
- Code heavily inspired by the following projects:
- Adapting BERT for Extractive Summariation: BertSum
- Text Summarization with Pretrained Encoders: PreSumm
- Word/Sentence Embeddings: sentence-transformers
- CNN/CM Dataset: cnn-dailymail
- PyTorch Lightning Classifier: lightning-text-classification
- Important projects utilized:
- PyTorch: pytorch
- Training code: pytorch_lightning
- Transformer Models: huggingface/transformers
All Pull Requests are greatly welcomed.
Questions? Commends? Issues? Don't hesitate to open an issue and briefly describe what you are experiencing (with any error logs if necessary). Thanks.
- Fork it (https://github.com/HHousen/TransformerSum/fork)
- Create your feature branch (
git checkout -b feature/fooBar
) - Commit your changes (
git commit -am 'Add some fooBar'
) - Push to the branch (
git push origin feature/fooBar
) - Create a new Pull Request