Skip to content

cisnlp/parcoure

Repository files navigation

logo ParCourE

This repository contains code for ParCourE, the Parallel Corpus Explorer. It is a WebApp to browse a word aligned multiparallel corpus. You can view one instance of ParCourE that runs a word aligned version of the Parallel Bible Corpus by Mayer and Cysouw (2014) here.

Setup

In this guide we will showcase how to set ParCourE up for a parallel corpus. We will download a parallel corpora in XCES format, more specifically a small version of bible corpus from Opus and set up ParCourE for it.

1. Environment

  • Using Anaconda you can create an environment having the required dependencies using following commands: conda env create --file dependencies.yaml

  • Switch to the newly created environment: conda activate parcoure

If you don't use Anaconda you will have to install the dependencies listed in dependencies.yaml file in your environment of choice.

2. Download Corpus

Download the following files from the opus website and extract them. Alternatively, you can of course download the corpora of your choice in languages of your choice.

After extraction, put the language specific data files and inter language alignment files in one direcotry. In this example we put English.xml English-WEB.xml Farsi.xml German.xml en-pes.xml de-en.xml de-pes.xml files in a directory called CES_Corpus.

3. Elasticsearch

Install Elasticsearch from here and start the server. Then add its address to the config file (see below). Elasticsearch uses port number 9200 by default. If you change it you have to also modify it in config file. Also make sure that Elasticsearch is accessible from ParCourE's machine.

Check if Elasticsearch is accessable (use your id address instead of localhost if you have installed Elasticsearch on another server):

$> curl localhost:9200

{
  "name" : "hostName",
  "cluster_name" : "elasticsearch",
  "cluster_uuid" : "-IGDqOQOSwWnVY-RVHSedg",
  "version" : {
    "number" : "7.9.2",
    "build_flavor" : "default",
    "build_type" : "tar",
    "build_hash" : "dsdfsac4e49417f2da2f244e3e97b4e6e",
    "build_date" : "2019-08-11T00:44:31.62642",
    "build_snapshot" : false,
    "lucene_version" : "5.4.2",
    "minimum_wire_compatibility_version" : "11.8.0",
    "minimum_index_compatibility_version" : "4.0.0-beta1"
  },
  "tagline" : "You Know, for Search"
}

4. Word Aligner

Install Simalign (It is mandatory for the input alignment page to work): pip install --upgrade git+https://github.com/cisnlp/simalign.git#egg=simalign

For alignment of your parallel corpus you can use any word alignment tool you prefer. Here we used SimAlign for as many languages as possible and eflomal for the remainign languages.

Some popular word aligners are:

Currently, to use aligners other than fast_align and eflomal, you should extract word alignments manually and put them in alignments_dir.

5. Configurations

Set the following in the file config.ini:

  • ces_corpus_dir: A directory containing the downloaded corpus files in CES format. The toy example will include the following files, from the extracted files above: de-en.xml de-pes.xml en-pes.xml English-WEB.xml English.xml Farsi.xml German.xml
  • ces_alignment_files: a comma separated list of files that correspond to sentence alignments. in our case it is de-pes.xml,de-en.xml,en-pes.xml
  • parcoure_data_dir: Provide ParCourE with a ABSOLUTE directory path where it can keep its data and configuration files
  • elasticsearch_address: IP and port of Elasticsearch.
  • fast_align_path: Something like "/my_installation_path/fast_align/build/". If you set extra_aligner_path ParCourE will use it for word alignment, otherwise it will use fast_align by default.
  • extra_aligner_path: (optional) Something like "/my_installation_path/eflomal/". If you don't set it, ParCourE will use fast_align to extract word alignments.
  • worker_count: Increasing this number will allow ParCourE to use more CPU cores to extract word alignments resulting in faster word alignment extraction during setup.

6. Prepare ParCourE

Run the prepare.sh script giving the config file as its parameter. The script will perform the following:

  • Convert the corpus to the format the ParCourE understands. Since at this stage ParCourE is creating the new corpora files, "file not found warnings" are negligible
  • Index the corpus with Elasticsearch
  • Create word level alignments
  • Precompute statistics
  • Precompute lexicon
bash ./prepare.sh

optional step: check the elasticSearch/indexing.log file to see if all files have been indexed correctly.

7. Run ParCourE

  • Set FLASK_SECRET_KEY which is a hard to guess secret string in execute.sh
  • run ParCourE
bash ./execute.sh

References

For more details see the paper:

@article{imani-etal-2021-parcoure,
    title = "{P}ar{C}our{E}: A Parallel Corpus Explorer for a Massively Multilingual Corpus",
    author = {Imani, Ayyoob and 
      Jalili Sabet, Masoud  and
      Dufter, Philipp  and
      Cysouw, Michael  and
      Sch{\"u}tze, Hinrich},
    year = "2021",
    note = "to be published"
}

Feedback

Feedback and Contributions more than welcome! Just reach out to @ayyoobimani, @masoudjs, @pdufter or create an issue or pull-request.

License

Copyright (C) 2020, Ayyoob Imani, Masoud Jalili Sabet, Philipp Dufter

A full copy of the license can be found in LICENSE.