ParCourE

This repository contains code for ParCourE, the Parallel Corpus Explorer. It is a WebApp to browse a word aligned multiparallel corpus. You can view one instance of ParCourE that runs a word aligned version of the Parallel Bible Corpus by Mayer and Cysouw (2014) here.

Setup

In this guide we will showcase how to set ParCourE up for a parallel corpus. We will download a parallel corpora in XCES format, more specifically a small version of bible corpus from Opus and set up ParCourE for it.

1. Environment

Using Anaconda you can create an environment having the required dependencies using following commands: conda env create --file dependencies.yaml
Switch to the newly created environment: conda activate parcoure

If you don't use Anaconda you will have to install the dependencies listed in dependencies.yaml file in your environment of choice.

2. Download Corpus

Download the following files from the opus website and extract them. Alternatively, you can of course download the corpora of your choice in languages of your choice.

After extraction, put the language specific data files and inter language alignment files in one direcotry. In this example we put English.xml English-WEB.xml Farsi.xml German.xml en-pes.xml de-en.xml de-pes.xml files in a directory called CES_Corpus.

3. Elasticsearch

Install Elasticsearch from here and start the server. Then add its address to the config file (see below). Elasticsearch uses port number 9200 by default. If you change it you have to also modify it in config file. Also make sure that Elasticsearch is accessible from ParCourE's machine.

Check if Elasticsearch is accessable (use your id address instead of localhost if you have installed Elasticsearch on another server):

$> curl localhost:9200

{
  "name" : "hostName",
  "cluster_name" : "elasticsearch",
  "cluster_uuid" : "-IGDqOQOSwWnVY-RVHSedg",
  "version" : {
    "number" : "7.9.2",
    "build_flavor" : "default",
    "build_type" : "tar",
    "build_hash" : "dsdfsac4e49417f2da2f244e3e97b4e6e",
    "build_date" : "2019-08-11T00:44:31.62642",
    "build_snapshot" : false,
    "lucene_version" : "5.4.2",
    "minimum_wire_compatibility_version" : "11.8.0",
    "minimum_index_compatibility_version" : "4.0.0-beta1"
  },
  "tagline" : "You Know, for Search"
}

4. Word Aligner

Install Simalign (It is mandatory for the input alignment page to work): pip install --upgrade git+https://github.com/cisnlp/simalign.git#egg=simalign

For alignment of your parallel corpus you can use any word alignment tool you prefer. Here we used SimAlign for as many languages as possible and eflomal for the remainign languages.

Some popular word aligners are:

Currently, to use aligners other than fast_align and eflomal, you should extract word alignments manually and put them in alignments_dir.

5. Configurations

Set the following in the file config.ini:

ces_corpus_dir: A directory containing the downloaded corpus files in CES format. The toy example will include the following files, from the extracted files above: de-en.xml de-pes.xml en-pes.xml English-WEB.xml English.xml Farsi.xml German.xml
ces_alignment_files: a comma separated list of files that correspond to sentence alignments. in our case it is de-pes.xml,de-en.xml,en-pes.xml
parcoure_data_dir: Provide ParCourE with a ABSOLUTE directory path where it can keep its data and configuration files
elasticsearch_address: IP and port of Elasticsearch.
fast_align_path: Something like "/my_installation_path/fast_align/build/". If you set extra_aligner_path ParCourE will use it for word alignment, otherwise it will use fast_align by default.
extra_aligner_path: (optional) Something like "/my_installation_path/eflomal/". If you don't set it, ParCourE will use fast_align to extract word alignments.
worker_count: Increasing this number will allow ParCourE to use more CPU cores to extract word alignments resulting in faster word alignment extraction during setup.

6. Prepare ParCourE

Run the prepare.sh script giving the config file as its parameter. The script will perform the following:

Convert the corpus to the format the ParCourE understands. Since at this stage ParCourE is creating the new corpora files, "file not found warnings" are negligible
Index the corpus with Elasticsearch
Create word level alignments
Precompute statistics
Precompute lexicon

bash ./prepare.sh

optional step: check the elasticSearch/indexing.log file to see if all files have been indexed correctly.

7. Run ParCourE

Set FLASK_SECRET_KEY which is a hard to guess secret string in execute.sh
run ParCourE

bash ./execute.sh

check out the result at http://localhost:8000/

References

For more details see the paper:

@article{imani-etal-2021-parcoure,
    title = "{P}ar{C}our{E}: A Parallel Corpus Explorer for a Massively Multilingual Corpus",
    author = {Imani, Ayyoob and 
      Jalili Sabet, Masoud  and
      Dufter, Philipp  and
      Cysouw, Michael  and
      Sch{\"u}tze, Hinrich},
    year = "2021",
    note = "to be published"
}

Feedback

Feedback and Contributions more than welcome! Just reach out to @ayyoobimani, @masoudjs, @pdufter or create an issue or pull-request.

License

A full copy of the license can be found in LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 122 Commits
app		app
elasticSearch		elasticSearch
playground		playground
static		static
tools		tools
.bowerrc		.bowerrc
.gitignore		.gitignore
LICENSE		LICENSE
bower.json		bower.json
config.ini		config.ini
config.py		config.py
config_pbc.ini		config_pbc.ini
demo.py		demo.py
dependencies.yaml		dependencies.yaml
execute.sh		execute.sh
gunicorn_config.py		gunicorn_config.py
prepare.py		prepare.py
prepare.sh		prepare.sh
readme.md		readme.md
setup_old.sh		setup_old.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ParCourE

Setup

1. Environment

2. Download Corpus

3. Elasticsearch

4. Word Aligner

5. Configurations

6. Prepare ParCourE

7. Run ParCourE

References

Feedback

License

About

Releases 1

Packages

Contributors 3

Languages

License

cisnlp/parcoure

Folders and files

Latest commit

History

Repository files navigation

ParCourE

Setup

1. Environment

2. Download Corpus

3. Elasticsearch

4. Word Aligner

5. Configurations

6. Prepare ParCourE

7. Run ParCourE

References

Feedback

License

About

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 3

Languages

Packages