Skip to content

LongT5-based model pre-trained on a large amount of unlabeled Vietnamese news texts and fine-tuned with ViMS and VMDS collections

License

Notifications You must be signed in to change notification settings

nicolay-r/ViLongT5

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ViLongT5 • twitter

PRs welcome! twitter

A pretrained Transformer-based encoder-decoder model for the multi-document text-summarization task in Vietnamese language. The code represents a non-framework implementation, which combines flaxformer, t5x and purely based on JAX library.

ViLongT5 is trained on a large NewsCorpus of Vietnamese news texts. We benchmark ViLongT5 on multidocument text-summarization tasks, Abstractive Text Summarization and Named Entity Recognition. All the experiments are shown in our paper Pre-training LongT5 for Vietnamese Mass-Media Multi-document Summarization Task

Pretrained Models

Vocabulary: ViLongT5_vocab / training-script

Model Gin File Location Checkpoint Location
ViLongT5-Large ViLongT5_large.gin ViLongt5-finetuned-large.tar.gz

📄 Example scripts based on Flaxformer library for model: finetunning / inferring / evaluating

Results

image

Datasets

List of datasets utilized in experiments conduction:

Installation

NOTE: considering GPU as a computational device. This project has been tested under the following configuration

Local Installation

  • Initialize virtual environment and install project dependencies:
virtualenv env --python=/usr/bin/python3.9`
pip install -r dependencies.txt

Kaggle Installation

For testing under Kaggle, there is a separted tutorial.

Fine-tuning

We finetunning the model based on training part of the vims+vmds+vlsp training part as follows:

python -m t5x.train --gin_file="longt5_finetune_vims_vmds_vlsp_large.gin" --gin_search_paths='./configs'

Inferring

Evaluation

For vims+vmds+vlsp (test part) is as follows:

python -m t5x.eval --gin_file="longt5_eval_vims_vmds_vlsp_large.gin" --gin_search_paths='./configs'

For vlsp (validation part) is as follows:

python -m t5x.eval --gin_file="configs/longt5_infer_vlsp_validation_large.gin" --gin_search_paths='./configs'

References

@inproceedings{rusnachenko2023pretraining,
    title = "Pre-training {LongT5} for Vietnamese Mass-Media Multi-document Summarization Task",
    author = "Rusnachenko, Nicolay and Le, The Anh and Nguyen, Ngoc Diep",
    booktitle = "Proceedings of Artificial Intelligence and Natural Language",
    year = "2023"
}

About

LongT5-based model pre-trained on a large amount of unlabeled Vietnamese news texts and fine-tuned with ViMS and VMDS collections

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages