Skip to content

deep-bgt/Deep-BGT

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Verbal Multiword Expression (VMWE) Identification

This repository contains an implementation of the Bidirectional LSTM-CRF model for VMWE identification and a novel tagging scheme bigappy-unicrossy for representation of overlaps in sequence labeling tasks.

PARSEME shared task on automatic identification of verbal MWEs - edition 1.1

The PARSEME shared task 2018 on automatic identification of verbal MWEs (VMWE) covers 20 languages. The corpora provided are in cupt format and include annotations of VMWEs consisting of categories. The categories of VMWEs are light verb constructions with two subcategories (LVC.full and LVC.cause), verbal idioms (VID), inherently reflexive verbs (IRV), verb-particle constructions with two subcategories (VPC.full and VPC.semi), multi-verb constructions (MVC), inherently adpositional verbs (IAV) and inherently clitic verbs (LS.ICV).

The corpora has been published at LINDAT/CLARIN.

We are the annotators of Turkish corpus. The paper describing our annotation process is Turkish Verbal Multiword Expressions Corpus.

Bidirectional LSTM-CRF Model for Verbal Multiword Expression Identification

The Deep-BGT system participated to the PARSEME shared task 2018. Our system is language-independent and uses the bidirectional Long Short-Term Memory model with a Conditional Random Field layer on top (bidirectional LSTM-CRF).

The architecture is decribed in our paper: Deep-BGT at PARSEME Shared Task 2018: Bidirectional LSTM-CRF Model for Verbal Multiword Expression Identification

To the best of our knowledge, this paper is the first one that employs the bidirectional LSTM-CRF model for VMWE identification.

The gappy 1-level tagging scheme is used. Our system was evaluated on 10 languages in the open track and it was ranked the second in terms of the general ranking metric. (Shared Task Results)

Later on, we evaluated our system on 19 languages. The results will be published in Springer's Lecture Notes in Computer Science (LNCS), CICLing 2019.

A Novel Tagging Scheme: bigappy-unicrossy

We introduce a new tagging scheme called bigappy-unicrossy to rise to the challenge of overlapping MWEs.

The bigappy-unicrossy tagging scheme is compared with the IOB2 and the gappy 1-level tagging schemes in VMWE identification task (PARSEME shared task) using our previous system Deep-BGT. It is evaluated our system on 19 languages. The results will be published in Springer's Lecture Notes in Computer Science (LNCS), CICLing 2019.

Citation

If you make use of our implementation regarding the deep learning architecture, please cite the following paper: Deep-BGT at PARSEME Shared Task 2018: Bidirectional LSTM-CRF Model for Verbal Multiword Expression Identification

If you make use of our implementation regarding the tagging scheme, please cite the following paper: Representing Overlaps in Sequence Labeling Tasks with a Novel Tagging Scheme: bigappy-unicrossy (will be published in Springer's Lecture Notes in Computer Science (LNCS), CICLing 2019).

Implementation

Requirements

  • Python 3.6

  • Keras 2.2.4 with Tensorflow 1.12.0, and keras-contrib==2.0.8

  • We cannot guarantee that the code works with different versions for Keras / Tensorflow.

  • We cannot provide the data used in the experiments in this code repository, because we have no right to distribute the corpora provided by PARSEME Shared Task Edition 1.1 .

     1. Please download corpora from https://gitlab.com/parseme/sharedtask-data
        Unzip the downloaded file
        Locate it into input/corpora
     2. All word embeddings are available in the https://github.com/facebookresearch/fastText/blob/master/docs/crawl-vectors.md
        Download word embeddings and locate them into input/embeddings
     3. Language codes are Bulgarian (BG), German (DE), Greek (EL), English (EN), Spanish (ES), Basque (EU), Farsi (FA), French (FR),
        Hebrew (HE), Hindu (HI), Crotian (HR), Hungarian (HU), Italian (IT), Lithuanian (LT),
         Polish (PL), Portuguese (PT), Romanian (RO), Slovenian (SL), and Turkish (TR).
    

Setup with virtual environment (Python 3):

  • python3 -m venv my_venv

    source my_venv/bin/activate

  • Install the requirements: pip3 install -r requirements.txt

If everything works well, you can run the example usage described below.

Example Usage:

  • The following guide show an example usage of the model for English with bigappy-unicrossy tagging scheme.

  • Instructions

    1. Change directory to the location of the source code
    2. Run the instructions in "Setup with virtual environment (Python 3)"
    3. Run the command to train the model: python3 src/Runner.py -l EN -t gappy-crossy
       If you want to try the model with another configuration, change language code after -l, and tag after -t
       Languages: BG, DE, EL, EN, ES, EU, FA, FR, HE, HI HR, HU, LT, IT, PL, PT, RO, SL, TR
       Tags: IOB, gappy-1, gappy-crossy