The data are used in our AAAI-18 paper Translating Pro-Drop Languages with Reconstruction Models.
The corpus is designed to be dialogue domain and parallel data with larger-context information for research purpose. More than two million sentence pairs were extracted from the subtitles of television episodes.
Within the corpus, sentences are generally short and the Chinese side contains many examples of dropped pronouns (DPs). Therefore, the corpus was initially designed for pro-drop language translation task, and the related paper (Translating Pro-Drop Languages with Reconstruction Models) was accepted by AAAI 2018 conference.
Actually, the corpus can be also used for various translation tasks such as larger-context MT (Exploiting Cross-Sentence Context for Neural Machine Translation; Learning to Remember Translation History with a Continuous Cache).
The differences to other existing bilignaul subtitle corpora are as follows:
-
We only extract subtitles of television episodes instead of movie ones. The vocabulary in movies is more sparsity than that in TV series. To aviod the long-tail problems, we use TV series data for MT tasks.
-
We pre-processed the extracted data using a number of in-house scripts including sentence boundary detection and bilingual sentence alignment etc. Thus, we obtained a more cleaner, better-aligned, high-quality corpus.
-
We keep the larger-context information instead of disordering sentences. Thus, you can mine useful discourse information from the previous or following sentences for MT.
-
We randomly select two complete television episodes as the tuning set, and another two episodes as the test set. We manually create multiple references for them.
-
In order to re-implement our AAAI-18 paper (Translating Pro-Drop Languages with Reconstruction Models), we also released the +DP corpus, in which the Chinese sentences are automatically labelled with DPs using alignment information.
Plsease clone the repo, because we may update new version of data in the future.
git clone https://github.com/longyuewangdcu/tvsub.git
The folder stucture is as follows:
++ tvsub (root)
++++ data
++++++ orignal corpus
++++++++ train
++++++++ dev
++++++++ test
++++++ preprocessed corpus
++++++++ train
++++++++ dev
++++++++ test
The following table lists the statistics of the corpus.
- Longyue Wang - crawling and pre-processing data
- Zhaopeng Tu - dev and test sets
If you use the data, please cite the following paper:
Longyue Wang, Zhaopeng Tu, Shuming Shi, Tong Zhang, Yvette Graham, Qun Liu. (2018). "Translating Pro-Drop Languages with Reconstruction Models", Proceedings of the 32nd AAAI Conference on Artificial Intelligence (AAAI 2018).
@inproceedings{wang2018aaai,
title={Translating Pro-Drop Languages with Reconstruction Models},
author={Wang, Longyue and Tu, Zhaopeng and Shi, Shuming and Zhang, Tong and Graham, Yvette and Liu, Qun},
year={2018},
publisher = {{AAAI} Press},
booktitle={Proceedings of the Thirty-Second {AAAI} Conference on Artificial Intelligence},
address={New Orleans, Louisiana, USA},
pages={1--9}
}
The data were crawled from the subtitle websites: http://assrt.net and http://www.zimuzu.tv. If you use the TVsub corpus, please add these links (http://www.zimuzu.tv and http://assrt.net) to your website and publications!
This data is only used for research purpose.
Plsease read the License Agreement before you use the data.
The released data is part of contribution of our AAAI-18 paper.
The ADAPT Centre for Digital Content Technology is funded under the SFI Research Centres Programme (Grant 13/RC/2106) and is co-funded under the European Regional Development Fund. Work was done when Longyue Wang was interning at Tencent AI Lab.