This repository provides the source code & data of our paper: Text-to-Text Extraction and Verbalization of Biomedical Event Graphs.
In bioinformatics, events represent complex interactions mentioned in the scientific literature, involving a set of entities (e.g., proteins, genes, diseases, drugs), each contributing with a specific semantic role (e.g., theme, cause, site). For instance, biomedical events include molecular reactions, organism-level outcomes, and adverse drug reactions. Text-to-event (or event extraction, EE) and event-to-text (or event graph verbalization, EGV) systems effectively bridge natural language and symbolic representations. They provide a step towards decoupling concept units (what to say) from language competencies (how to say it). Almost all contributions in the event realm orbit around semantic parsing, usually employing discriminative architectures and cumbersome multi-step pipelines limited to a small number of target interaction types. Despite being less explored, EGV holds a lot of potential as well, targeting the generation of informative text constrained on semantic graphs, crucial in applications like conversational agents and summarization systems. We present the first lightweight framework to solve both event extraction and event verbalization with a unified text-to-text approach, allowing us to fuse all the resources so far designed for different tasks. To this end, we present a new event graph linearization technique and release highly comprehensive event-text paired datasets (BioT2E and BioE2T), covering more than 150 event types from multiple biology subareas (English language). By streamlining parsing and generation to translations, we propose baseline transformer model results according to multiple biomedical text mining benchmarks and natural language generation metrics. Our extractive models achieve greater state-of-the-art performance than single-task competitors and show promising capabilities for the controlled generation of coherent natural language utterances from structured data.
General
- Python (verified on 3.8)
- CUDA (verified on 11.1)
Python Packages
- See
docker/requirements.txt
Our BioE2T and BioT2E datasets are derived from 10 influential benchmarks originally designed for biomedical EE (BEE) and primarily released within BioNLP-ST competitions. For your convenience, we include these freely accessible benchmarks directly within the repository: data/datasets/original_datasets.tar.gz
.
Corpus | Domain | #Documents | Annotation Schema |
---|---|---|---|
Genia Event Corpus (GE08) | Human blood cells transcription factors | 1,000 abstracts | 35 entity types, 35 event types |
Genia Event 2011 (GE11) | See Genia08 | 1,210 abstracts, 14 full papers | 2 entity types, 9 event types, 2 modifiers |
Epigenetics and Post-translational Modification (EPI11) | Epigenetic change and common protein post-translational modifications | 1,200 abstracts | 2 entity types, 14 event types, 2 modifiers |
Infectious Diseases (ID11) | Two-component regulatory systems | 30 full papers | 5 entity types, 10 event types, 2 modifiers |
Multi-Level Event Extraction (MLEE) | Blood vessel development from the subcellular to the whole organism | 262 abstracts | 16 entity types, 19 event types |
GENIA-MK | See GE08 | 1,000 abstracts | 35 entity types, 35 event types, 5 modifiers (+2 inferable) |
Genia Event 2013 (GE13) | See GE08 | 34 full papers | 2 entity types, 13 event types, 2 modifiers |
Cancer Genetics (CG13) | Cancer biology | 600 abstracts | 18 entity types, 40 event types, 2 modifiers |
Pathway Curation (PC13) | Reactions, pathways, and curation | 525 abstracts | 4 entity types, 23 event types, 2 modifiers |
Gene Regulation Ontology (GRO13) | Human gene regulation and transcription | 300 abstracts | 174 entity types, 126 event types |
We publicly release our BioT2E (data/datasets/biot2e
) and BioE2T (data/datasets/bioe2t
) text-to-text datasets for event extraction and event graph verbalization, respectively. For replicability, we also provide the preprocessing, filtering, and sampling scripts (notebooks/create_datasets.ipynb
) used for their automatic generation mostly from EE datasets following a .txt/.a1/.a2 or .ann structure.
We trained and evaluated T5 and BART models.
- We reimplemented T5-Base (∼220M parameters, 12-layers, 768-hidden, 12- heads) in Flax (T5X) starting from the Google Research codebase; see
https://github.com/disi-unibo-nlp/bio-ee-egv/blob/main/src/utils/t5x
. - We built our BART-Base (∼139M, 12-layers, 768-hidden, 16-heads) model in PyTorch using the HuggingFace’s Transformers library.
- Generate prediction files using the following scripts.
- Check the evaluation notebook (
./notebooks/evaluate_ee.ipynb
) to run the automatic evaluation.
- EE →
python3 ./src/test_scripts/t5x/test_ee_t5.py
- EGV →
python3 ./src/test_scripts/t5x/test_egv_t5.py
- PubMed Summ →
python3 ./src/test_scripts/t5x/test_summarization_t5.py
- Multi-task Learning (EE + EGV + PubMed Summ) →
python3 ./src/test_scripts/t5x/test_mtl_t5.py
- EE →
python3 ./src/test_scripts/bart/test_ee_bart.py
- EGV →
python3 ./src/test_scripts/bart/test_egv_bart.py
EE | Trained model | Val F1 (%), AVG on 10 benchmarks |
---|---|---|
T5[BioT2E] [link] | 80.25 | |
BART[BioT2E] [link] | 73.50 | |
EGV | Trained model | Val ROUGE-1/2/L F1 AVG (%) |
T5[BioE2T] [link] | 65.40 | |
BART[BioE2T] [link] | 54.30 |
- Giacomo Frisoni, giacomo.frisoni[at]unibo.it ♣
- Gianluca Moro, gianluca.moro[at]unibo.it
- Lorenzo Balzani, balzanilo[at]icloud.com ♣
♣ = Mantainers
If you have troubles, suggestions, or ideas, the Discussion board might have some relevant information. If not, you can post your questions there 💬🗨.
This project is released under the CC-BY-NC-SA 4.0 license (see LICENSE
).
If you use the reported code, datasets, or models in your research, please cite:
@inproceedings{frisoni-etal-2022-text,
title = "Text-to-Text Extraction and Verbalization of Biomedical Event Graphs",
author = "Frisoni, Giacomo and
Moro, Gianluca and
Balzani, Lorenzo",
booktitle = "Proceedings of the 29th International Conference on Computational Linguistics",
month = oct,
year = "2022",
address = "Gyeongju, Republic of Korea",
publisher = "International Committee on Computational Linguistics",
url = "https://aclanthology.org/2022.coling-1.238",
pages = "2692--2710"
}