A substantial amount of valuable knowledge is recorded in the form of unstructured text data, such as news, emails, journal articles, etc. The task of identifying semantic relations among the text entities is referred to as relation extraction. The goal of this project was to build a system to extract information templates of the following forms:
-
BORN (Person/Organization, Date, Location)
-
ACQUIRE (Organization, Organization, Date)
-
PART_OF
-
PART_OF (Organization, Organization)
-
PART_OF (Location, Location)
-
To build this system, we were provided with 30 text files containing Wikipedia articles that are split as follows:
- 10 articles related to Organizations
- 10 articles related to Persons
- 10 articles related to Locations
The general architecture of the information extraction pipeline is as follows:
For more details about the intricacies of the approach and general implementation details, read this.
1. BORN * Abraham Lincoln was born on February 12, 1809, as the second child of Thomas and Nancy Hanks Lincoln, in a one-room log cabin on Sinking Spring Farm near Hodgenville, Kentucky. Argument-1: Abraham Lincoln Argument-2: February 12, 1809 Argument-3: Hodgenville * In May 2002, Musk founded SpaceX, an aerospace manufacturer and spacetransport services company, of which he is CEO and lead designer. Argument-1: SpaceX Argument-2: May 2002. 2. ACQUIRE * Compaq acquired Zip2 for US$307 million in cash and US$34 million instock options in February 1999. Argument-1: Compaq Argument-2: Zip2 Argument-3: February 1999 * In 2015, Tesla acquired Riviera Tool & Die (with 100 employees inMichigan), one of its suppliers of stamping items. Argument-1: Tesla Argument-2: Riviera Tool & Die Argument-3: 2015 3. PART_OF * They met in Springfield, Illinois in December 1839 and were engaged a year later. Argument-1: Springfield Argument-2: Illinois * The Mahatma Gandhi District in Houston, Texas, United States, an ethnic Indian enclave, is officially named after Gandhi. Argument-1: Houston Argument-2: Texas
This section describes the preqrequisites, and contains instructions, to get the project up and running.
This project can easily be set up with all the prerequisite packages by following these instructions:
-
Install Conda using the
conda_install.sh
file, with the command:$ bash conda_install.sh
-
Create a conda environment from the included
environment.yml
file using the following command:$ conda env create -f environment.yml
-
Activate the environment
$ conda activate chomskIE
-
To install the package with setuptools extras, use the following command in the top-level directory containing the
setup.py
file:$ pip install .
The user can get a description of the options by using the command:
> python __main__.py --help
To run the relation extraction pipeline on a batch of documents:
> python __main__.py --input_path <path-to-input-dir> --output_path <path-to-output-dir>
To run the relation extraction pipeline on a single document:
> python __main__.py --input_path <path-to-input-dir> --output_path <path-to-output-dir> --transform
- python >= 3.7
- spaCy
- textaCy == 0.11.0
There are no specific guidelines for contributing, apart from a few general guidelines we tried to follow, such as:
- Code should follow PEP8 standards as closely as possible
- We use Google-Style docstrings to document the Python modules in this project.
If you see something that could be improved, send a pull request!
We are always happy to look at improvements, to ensure that chomskIE
, as a project, is the best version of itself.
If you think something should be done differently (or is just-plain-broken), please create an issue.
See the LICENSE file for more details.