Here you can find an overview about biomedical NER data sets integrated in HunFlair.
Content: Overview | HUNER Data Sets | BioBERT Evaluation Splits
HunFlair integrates 31 biomedical named entity recognition (NER) data sets and provides
them in an unified format to foster the development and evaluation of new NER models. All
data set implementations can be found in flair.datasets.biomedical
.
Corpus | Data Set Class | Entity Types | Reference |
---|---|---|---|
AnatEM | ANAT_EM |
Anatomical entities | Paper, Website |
Arizona Disease | AZDZ |
Disease | Website |
BioCreative II GM | BC2GM |
Gene | Paper |
BioCreative V CDR task | CDR |
Chemical, Disease | Paper, Website |
BioInfer | BIO_INFER |
Gene/Protein | Paper |
BioNLP'2013 Cancer Genetics (ST) | BIONLP2013_CG |
Chemical, Disease, Gene/Protein, Species | Paper |
BioNLP'2013 Pathway Curation (ST) | BIONLP2013_PC |
Chemical, Gene/Proteins | Paper |
BioSemantics* | BIOSEMANTICS |
Chemical, Disease | Paper, Website |
CellFinder | CELL_FINDER |
Cell line, Gene, Species | Paper |
CEMP | CEMP |
Chemical | Website |
CHEBI | CHEBI |
Chemical, Gene, Species | Paper |
CHEMDNER | CHEMDNER |
Chemical | Paper |
CLL | CLL |
Cell line | Paper |
DECA | DECA |
Gene | Paper |
FSU | FSU |
Gene | Paper |
GPRO | GPRO |
Gene | Website |
CRAFT (v2.0) | CRAFT |
Chemical, Gene, Species | Paper |
CRAFT (v4.0.1) | CRAFT_V4 |
Chemical, Gene, Species | Website |
GELLUS | GELLUS |
Cell line | Paper |
IEPA | IEPA |
Gene | Paper |
JNLPBA | JNLPBA |
Cell line, Gene | Paper |
LINNEAUS | LINNEAUS |
Species | Paper |
LocText | LOCTEXT |
Gene, Species | Paper |
miRNA | MIRNA |
Disease, Gene, Species | Paper |
NCBI Disease | NCBI_DISEASE |
Disease | Paper |
Osiris v1.2 | OSIRIS |
Gene | Paper |
Plant-Disease-Relations | PDR |
Disease | Paper, Website |
S800 | S800 |
Species | Paper |
SCAI Chemicals | SCAI_CHEMICALS |
Chemical | Paper |
SCAI Disease | SCAI_DISEASE |
Disease | Paper |
Variome | VARIOME |
Gene, Disease, Species | Paper |
* The corpus is currently not available, but will be re-published online soon.
Next to the integration of the biomedical data sets, HunFlair provides the fixed splits used by HUNER (Weber et al.) to improve comparability of evaluations
Entity Type | Data Set Class | Contained Data Sets |
---|---|---|
Cell Line | HUNER_CELL_LINE |
HUNER_CELL_LINE_CELL_FINDER , HUNER_CELL_LINE_CLL , HUNER_CELL_LINE_GELLUS , HUNER_CELL_LINE_JNLPBA |
Chemical | HUNER_CHEMICAL |
HUNER_CHEMICAL_CDR , HUNER_CHEMICAL_CEMP , HUNER_CHEMICAL_CHEBI , HUNER_CHEMICAL_CHEMDNER , HUNER_CHEMICAL_CRAFT_V4 , HUNER_CHEMICAL_SCAI |
Disease | HUNER_DISEASE |
HUNER_DISEASE_CDR , HUNER_DISEASE_MIRNA , HUNER_DISEASE_NCBI , HUNER_DISEASE_SCAI , HUNER_DISEASE_VARIOME |
Gene/Protein | HUNER_GENE |
HUNER_GENE_BC2GM , HUNER_GENE_BIO_INFER , HUNER_GENE_CELL_FINDER , HUNER_GENE_CHEBI , HUNER_GENE_CRAFT_V4 , HUNER_GENE_DECA , HUNER_GENE_FSU , HUNER_GENE_GPRO , HUNER_GENE_IEPA , HUNER_GENE_JNLPBA , HUNER_GENE_LOCTEXT , HUNER_GENE_MIRNA , HUNER_GENE_OSIRIS , HUNER_GENE_VARIOME |
Species | HUNER_SPECIES |
HUNER_SPECIES_CELL_FINDER , HUNER_SPECIES_CHEBI , HUNER_SPECIES_CRAFT_V4 , HUNER_SPECIES_LINNEAUS , HUNER_SPECIES_LOCTEXT , HUNER_SPECIES_MIRNA , HUNER_SPECIES_S800 , HUNER_SPECIES_VARIOME |
To ease comparison with BioBERT, HunFlair provides the splits used by
Lee et al.:
BIOBERT_GENE_BC4CHEMD
, BIOBERT_GENE_BC2GM
, BIOBERT_GENE_JNLPBA
, BIOBERT_CHEMICAL_BC5CDR
,
BIOBERT_DISEASE_BC5CDR
, BIOBERT_DISEASE_NCBI
, BIOBERT_SPECIES_LINNAEUS
, and BIOBERT_SPECIES_S800
Note: To download and use the BioBERT corpora you need to install the package googledrivedownloader, since the files are hosted in Google Drive:
pip install googledrivedownloader