Skip to content

201801 How to Prepare a French to English Dataset for Machine Translation

Butcher Yang edited this page Jan 14, 2018 · 2 revisions

How to Prepare a French-to-English Dataset for Machine Translation

Fre-Eng Dataset for MT

Machine translation is the challenging task of converting text from a source language into coherent and matching text in a target language.

Neural machine translation systems such as encoder-decoder recurrent neural networks are achieving state-of-the-art results for machine translation with a single end-to-end system trained directly on source and target language.

Standard datasets are required to develop, explore, and familiarize yourself with how to develop neural machine translation systems.

In this tutorial, you will discover the Europarl standard machine translation dataset and how to prepare the data for modeling.

After completing this tutorial, you will know:

  • The Europarl dataset comprised of the proceedings from the European Parliament in a host of 11 languages.
  • How to load and clean the parallel French and English transcripts ready for modeling in a neural machine translation system.
  • How to reduce the vocabulary size of both French and English data in order to reduce the complexity of the translation task.

Let’s get started.

Tutorial Overview

This tutorial is divided into 5 parts; they are:

  • Europarl Machine Translation Dataset
  • Download French-English Dataset
  • Load Dataset
  • Clean Dataset
  • Reduce Vocabulary

Python Environment

This tutorial assumes you have a Python SciPy environment installed with Python 3 installed, The tutorial also assumes you have scikit-learn, Pandas, NumPy, and Matplotlib installed.

Europarl Machine Translation Dataset

The Europarl is a standard dataset used for statistical machine translation, and more recently, neural machine translation.

It is comprised of the proceedings of the European Parliament, hence the name of the dataset as the contraction Europarl.

The proceedings are the transcriptions of speakers at the European Parliament, which are translated into 11 different languages.

It is a collection of the proceedings of the European Parliament, dating back to 1996. Altogether, the corpus comprises of about 30 million words for each of the 11 official languages of the European Union

— Europarl: A Parallel Corpus for Statistical Machine Translation, 2005.

The raw data is available on the European Parliament website in HTML format.

The creation of the dataset was lead by Philipp Koehn, author of the book “Statistical Machine Translation.”

The dataset was made available for free to researchers on the website “European Parliament Proceedings Parallel Corpus 1996-2011,” and often appears as a part of machine translation challenges, such as the Machine Translation task in the 2014 Workshop on Statistical Machine Translation.

The most recent version of the dataset is version 7, released in 2012, comprised of data from 1996 to 2011.

Download French-English Dataset

We will focus on the parallel French-English dataset.

This is a prepared corpus of aligned French and English sentences recorded between 1996 and 2011.

The dataset has the following statistics:

- Sentences: 2,007,723
- French words: 51,388,643
- English words: 50,196,035

You can download the dataset from here:

  • Parallel corpus French-English (194 Megabytes)
  • Once downloaded, you should have the file “fr-en.tgz” in your current working directory.
  • http://www.statmt.org/europarl/v7/fr-en.tgz
  • English: europarl-v7.fr-en.en (288M)
  • French: europarl-v7.fr-en.fr (331M)
#english
Resumption of the session
I declare resumed the session of the European Parliament adjourned on Friday 17 December 1999, and I would like once again to wish you a happy new year in the hope that you enjoyed a pleasant festive period.
Although, as you will have seen, the dreaded 'millennium bug' failed to materialise, still the people in a number of countries suffered a series of natural disasters that truly were dreadful.
You have requested a debate on this subject in the course of the next few days, during this part-session.
In the meantime, I should like to observe a minute' s silence, as a number of Members have requested, on behalf of all the victims concerned, particularly those of the terrible storms, in the various countries of the European Union.
# french 
Reprise de la session
Je déclare reprise la session du Parlement européen qui avait été interrompue le vendredi 17 décembre dernier et je vous renouvelle tous mes vux en espérant que vous avez passé de bonnes vacances.
Comme vous avez pu le constater, le grand "bogue de l'an 2000" ne s'est pas produit. En revanche, les citoyens d'un certain nombre de nos pays ont été victimes de catastrophes naturelles qui ont vraiment été terribles.
Vous avez souhaité un débat à ce sujet dans les prochains jours, au cours de cette période de session.
En attendant, je souhaiterais, comme un certain nombre de collègues me l'ont demandé, que nous observions une minute de silence pour toutes les victimes, des tempêtes notamment, dans les différents pays de l'Union européenne qui ont été touchés.

Load Dataset

Let’s start off by loading the data files.

We can load each file as a string. Because the files contain unicode characters, we must specify an encoding when loading the files as text. In this case, we will use UTF-8 that will easily handle the unicode characters in both files.

The function below, named load_doc(), will load a given file and return it as a blob of text.

def load_doc(filename):
	# open the file as read only
	file = open(filename, mode='rt', encoding='utf-8')
	# read all text
	text = file.read()
	# close the file
	file.close()
	return text
 
# split a loaded document into sentences
def to_sentences(doc):
	return doc.strip().split('\n')
 
# shortest and longest sentence lengths
def sentence_lengths(sentences):
	lengths = [len(s.split()) for s in sentences]
	return min(lengths), max(lengths)
 
# load English data
filename = 'europarl-v7.fr-en.en'
doc = load_doc(filename)
sentences = to_sentences(doc)
minlen, maxlen = sentence_lengths(sentences)
print('English data: sentences=%d, min=%d, max=%d' % (len(sentences), minlen, maxlen))
 
# load French data
filename = 'europarl-v7.fr-en.fr'
doc = load_doc(filename)
sentences = to_sentences(doc)
minlen, maxlen = sentence_lengths(sentences)
print('French data: sentences=%d, min=%d, max=%d' % (len(sentences), minlen, maxlen))
English data: sentences=2007723, min=0, max=668
French data: sentences=2007723, min=0, max=693

Clean Dataset

The data needs some minimal cleaning before being used to train a neural translation model.

Looking at some samples of text, some minimal text cleaning may include:

  • Tokenizing text by white space.
  • Normalizing case to lowercase.
  • Removing punctuation from each word.
  • Removing non-printable characters.
  • Converting French characters to Latin characters.
  • Removing words that contain non-alphabetic characters.
import string
import re
from pickle import dump
from unicodedata import normalize

# load doc into memory
def load_doc(filename):
	# open the file as read only
	file = open(filename, mode='rt', encoding='utf-8')
	# read all text
	text = file.read()
	# close the file
	file.close()
	return text

# split a loaded document into sentences
def to_sentences(doc):
	return doc.strip().split('\n')

# clean a list of lines
def clean_lines(lines):
	cleaned = list()
	# prepare regex for char filtering
	re_print = re.compile('[^%s]' % re.escape(string.printable))
	# prepare translation table for removing punctuation
	table = str.maketrans('', '', string.punctuation)
	for line in lines:
		# normalize unicode characters
		line = normalize('NFD', line).encode('ascii', 'ignore')
		line = line.decode('UTF-8')
		# tokenize on white space
		line = line.split()
		# convert to lower case
		line = [word.lower() for word in line]
		# remove punctuation from each token
		line = [word.translate(table) for word in line]
		# remove non-printable chars form each token
		line = [re_print.sub('', w) for w in line]
		# remove tokens with numbers in them
		line = [word for word in line if word.isalpha()]
		# store as string
		cleaned.append(' '.join(line))
	return cleaned

# save a list of clean sentences to file
def save_clean_sentences(sentences, filename):
	dump(sentences, open(filename, 'wb'))
	print('Saved: %s' % filename)

# load English data
filename = 'europarl-v7.fr-en.en'
doc = load_doc(filename)
sentences = to_sentences(doc)
sentences = clean_lines(sentences)
save_clean_sentences(sentences, 'english.pkl')
# spot check
for i in range(10):
	print(sentences[i])

# load French data
filename = 'europarl-v7.fr-en.fr'
doc = load_doc(filename)
sentences = to_sentences(doc)
sentences = clean_lines(sentences)
save_clean_sentences(sentences, 'french.pkl')
# spot check
for i in range(10):
	print(sentences[i])

Reduce Vocabulary

As part of the data cleaning, it is important to constrain the vocabulary of both the source and target languages.

The difficulty of the translation task is proportional to the size of the vocabularies, which in turn impacts model training time and the size of a dataset required to make the model viable.

In this section, we will reduce the vocabulary of both the English and French text and mark all out of vocabulary (OOV) words with a special token.

from pickle import load
from pickle import dump
from collections import Counter

# load a clean dataset
def load_clean_sentences(filename):
	return load(open(filename, 'rb'))

# save a list of clean sentences to file
def save_clean_sentences(sentences, filename):
	dump(sentences, open(filename, 'wb'))
	print('Saved: %s' % filename)

# create a frequency table for all words
def to_vocab(lines):
	vocab = Counter()
	for line in lines:
		tokens = line.split()
		vocab.update(tokens)
	return vocab

# remove all words with a frequency below a threshold
def trim_vocab(vocab, min_occurance):
	tokens = [k for k,c in vocab.items() if c >= min_occurance]
	return set(tokens)

# mark all OOV with "unk" for all lines
def update_dataset(lines, vocab):
	new_lines = list()
	for line in lines:
		new_tokens = list()
		for token in line.split():
			if token in vocab:
				new_tokens.append(token)
			else:
				new_tokens.append('unk')
		new_line = ' '.join(new_tokens)
		new_lines.append(new_line)
	return new_lines

# load English dataset
filename = 'english.pkl'
lines = load_clean_sentences(filename)
# calculate vocabulary
vocab = to_vocab(lines)
print('English Vocabulary: %d' % len(vocab))
# reduce vocabulary
vocab = trim_vocab(vocab, 5)
print('New English Vocabulary: %d' % len(vocab))
# mark out of vocabulary words
lines = update_dataset(lines, vocab)
# save updated dataset
filename = 'english_vocab.pkl'
save_clean_sentences(lines, filename)
# spot check
for i in range(10):
	print(lines[i])

# load French dataset
filename = 'french.pkl'
lines = load_clean_sentences(filename)
# calculate vocabulary
vocab = to_vocab(lines)
print('French Vocabulary: %d' % len(vocab))
# reduce vocabulary
vocab = trim_vocab(vocab, 5)
print('New French Vocabulary: %d' % len(vocab))
# mark out of vocabulary words
lines = update_dataset(lines, vocab)
# save updated dataset
filename = 'french_vocab.pkl'
save_clean_sentences(lines, filename)
# spot check
for i in range(10):
	print(lines[i])


English Vocabulary: 105357
New English Vocabulary: 41746
Saved: english_vocab.pkl
French Vocabulary: 141642
New French Vocabulary: 58800
Saved: french_vocab.pkl

Further Reading

This section provides more resources on the topic if you are looking to go deeper.

  • Europarl: A Parallel Corpus for Statistical Machine Translation, 2005.
  • European Parliament Proceedings Parallel Corpus 1996-2011 Homepage
  • Europarl Corpus on Wikipedia

Summary

In this tutorial, you discovered the Europarl machine translation dataset and how to prepare the data ready for modeling.

Specifically, you learned:

  • The Europarl dataset comprised of the proceedings from the European Parliament in a host of 11 languages.
  • How to load and clean the parallel French and English transcripts ready for modeling in a neural machine translation system.
  • How to reduce the vocabulary size of both French and English data in order to reduce the complexity of the translation task.