Skip to content

Latest commit

 

History

History
324 lines (289 loc) · 19.9 KB

README.md

File metadata and controls

324 lines (289 loc) · 19.9 KB

RareCrowds

Build Status License: GPL v3

Package to serve public data from rare disease patients as found in publications and public resources. Most cases here collected have only phenotypic data as a list of HPO terms. The package offers 5 core modules:

  • DiseaseAnnotations: Disease information.
  • HPO: Symptom analysis through HPO.
  • PatientSampler: Functionality to sample simulated patients based on the disease annotations and HPO.
  • PhenotypicComparison: Functionality to plot phenotypic comparisons between two phenotypic profiles.
  • PhenotypicDatabase: Local database to push available data to and pull data from. Publicly available data will be persisted here.

The 5 modules are covered in the Usage section below.

This package is in early development, so do not expect to see extense docstrings and sphinx documentation. At this point, this README is your best resource. Any doubt, please create an Issue and we'll give you an answer ASAP.

If you are not a Python programmer, but you are interested in analyzing these data and maybe try to create a disease prediction algorithm, you will find the data in the resources directory. You have all the nodes of the HPO ontology, the edges between them and a json with the disease information.

Installation

To install it simply run: pip install rarecrowds

The PyPI project lives here: https://pypi.org/project/rarecrowds/.

Usage

DiseaseAnnotations

Disease information is extracted from Orphanet's orphadata (product 4, product 9 (prevalence) and product 9 (ages)) and from the HPOA file created by the Monarch Initiative within the HPO project. By default, Orphanet's and OMIM phenotypic description of a rare disease extracted from the HPOA file are intersected. There is, in principle, no need for you to parse the data provided from these institutions.

In order to get information from a particular disease, use the following lines:

from rarecrowds import DiseaseAnnotations
dann = DiseaseAnnotations(mode='intersect')
data = dann.data['ORPHA:324']

This will output the information available about Fabry disease, with Orphanet's ID ORPHA:324. In order to query the disease information, please use Orphanet IDs. For further reference, visit www.orpha.net.

The following is an extract of the data returned by the lines above:

data = {
    'ageDeath': ['adult'],
    'ageOnset': ['Childhood'],
    'group': 'Disorder',
    'inheritance': ['X-linked recessive'],
    'link': 'http://www.orpha.net/consor/cgi-bin/OC_Exp.php?lng=en&Expert=324',
    'name': 'Fabry disease',
    'phenotype': {   'HP:0000083': {   'frequency': 'HP:0040281',
                                       'modifier': {   'diagnosticCriteria': True}},
                     'HP:0000091': {   'frequency': 'HP:0040282',
                                       'modifier': {   'diagnosticCriteria': True}},
                     ## Many other symptoms here
                     'HP:0100820': {   'frequency': 'HP:0040283',
                                       'modifier': {   'diagnosticCriteria': True}}},
    'prevalence': [   {   'class': '1-9 / 1 000 000',
                          'geographic': 'Europe',
                          'meanPrev': '0.22',
                          'qualification': 'Value and class',
                          'source': 'ORPHANET',
                          'type': 'Prevalence at birth',
                          'validation': {'status': 'Not yet validated'}},
                      ## Other prevalence studies here
                      {   'class': '1-9 / 100 000',
                          'geographic': 'Sweden',
                          'meanPrev': '1.11',
                          'qualification': 'Value and class',
                          'source': '25274184[PMID]',
                          'type': 'Prevalence at birth',
                          'validation': {'status': 'Validated'}}],
    'source': {},
    'type': 'Disease',
    'validation': {'date': '2016-06-01 00:00:00.0', 'status': 'y'}
}

Based on this data, one may subset the diseases in order to get a list of diseases of interest, highly recommended at the beginning of the development of a phenotype-based prediction algorithm:

# These lines come from the previous code
ann = dann.data
del phen
print(f'# total initial entities: {len(ann)}')
## Keep only disorders
for dis,val in list(ann.items()):
    if val['group'] != 'Disorder':
        del ann[dis]
print(f'# disases: {len(ann)}')
## Keep only those with phenotypic information
for dis,val in list(ann.items()):
    if not val.get('phenotype'):
        del ann[dis]
print(f'# disases with phenotype data: {len(ann)}')
## Remove clinial syndromes
for dis,val in list(ann.items()):
    if val['type'].lower() == 'clinical syndrome':
        del ann[dis]
print(f'# diseases w/o clinical syndromes: {len(ann)}')
## Keep only selected prevalences
valid_prev = ['>1 / 1000', '6-9 / 10 000', '1-5 / 10 000', '1-9 / 100 000', 'Unknown', 'Not yet documented']
for dis, val in list(ann.items()):
    if 'prevalence' in val:
        classes = [a['class'] for a in val['prevalence'] if a['type'] == 'Point prevalence']
        if not any(x in valid_prev for x in classes):
            del ann[dis]
    else:
        del ann[dis]
print(f'# disases with valid prevalence: {len(ann)}')

As a result, the number of entities in the disease annotations object should be reduced as follows:

# total initial entities: 6930
# disases: 5745
# disases with phenotypes: 3649
# diseases w/o clinical syndromes: 3628
# disases with valid prevalence: 1288

HPO

Symptoms are organized through the Human Phenotype Ontology (HPO). If you are not familiar with it, please visit the website.

In order to get information on specific symptom IDs and other items included in the HPO ontology, such as the frequency subontology, RareCrowds includes the HPO module. This module allows you to get information about each term and their relationships.

In order to get information about a specific HPO term, run the following lines:

from rarecrowds import Hpo
hpo = Hpo()
hpo['HP:0001250'] ## Get information about 'seizures'

In order to see the successors or predecessors of a particular term, run any of the following lines:

hpo.successors(['HP:0001250'])
hpo.predecessors(['HP:0001250'])

In order to simplify a phenotypic profile, leaving only most informative, yet independent, terms run the following lines:

hpo.simplify(['HP:0001250', 'HP:0007359'])

Available methods (apologies for the lack of documentation):

hpo.items(): returns all items in HPO. Keep in mind that not all items are phenotypic abnormalities. If you want all symptoms, call for ALL the successors of HP:0000118.
hpo.save_json(filename): saves the ontology as json.
hpo.json(): returns a json object of th eontology.
hpo.json_adjacency(): Dumps the adjacency matrix as json.
hpo.successors(ids, depth=1): Returns list of successors. If depth = 0 it returns immediate successors.
hpo.predecessors(ids, depth=1): Returns list of predecessors. If depth = 0 it returns immediate predecessors.
hpo.simplify(ids): Simplifies a phenotypic profile, leaving only most informative terms.

PatientSampler

This module allows the creation of realistic patient profiles based on the disease annotations. The following steps are followed to sample a patient from a given disease:

  1. Sample symptoms using the symptom frequency.
  2. From the selected symptoms, sample imprecision as a Poisson process with a certain probability of getting a less specific term using the HPO ontology.
  3. Add random noise sampling random HPO terms. The amount of random noise is also a Poisson process, while the selection of the HPO terms to include is uniform across the phenotypic abnormality subontology (disregarding too uninformative terms).
  4. Sample patient age by assuming that it is close to the disease onset plus a delay of ~1 month.
  5. Sample patient sex taking into account the inheritance pattern of the disease.

In order to sample 5 patients from a disease, run the following lines:

from rarecrowds import PatientSampler
sampler = PatientSampler()
patients = sampler.sample(patient_params="default", N=5)

As a result, an object similar to the following would be generated:

patients = {
    'ORPHA:324': {
        'id': 'ORPHA:324',
        'name': 'Fabry disease',
        'phenotype': {
            'HP:0000083': {'Frequency': 'HP:0040281'},
            ## Many other symptoms here
            'HP:0100820': {'Frequency': 'HP:0040283'}},
        'cohort': [ # As many items in the list as patients simulated
            {
                'ageOnset': None,
                'phenotype': {
                    'HP:0025031': {},
                    ## Other symptoms here
                    'HP:0100279': {}
                }
            }
        ]
    }
}

You can configure the imprecision and noise levels used to sample patient symptoms:

'''
These are the options for patient simulation parameters
 "default": {
    "imprecision": 1,
    "noise": 0.25,
    "omit_frequency": False,
},
"ideal": {
    "imprecision": 0,
    "noise": 0,
    "omit_frequency": True,
},  # For debugging. No noise. All patients = disease.
"freqs": {
    "imprecision": 0,
    "noise": 0,
    "omit_frequency": False,
},  # For simple cases without noise. All patients = disease*frequencies.
"impre": {
    "imprecision": 1,
    "noise": 0,
    "omit_frequency": False,
},  # Meant for patients without the most granular terms.
"impre2": {
    "imprecision": 2,
    "noise": 0,
    "omit_frequency": False,
}
'''

PhenotypicComparison

Comparing phenotypic profiles is often tricky. Venn diagrams are helpful, but often fall short in cases with complicated symptom relations. This module offers a detailed view of the overlap between, at most, 2 phenotypic profiles. It plots the HPO ontology graph with nodes color coded marking the common nodes and the nodes belonging to each profile. The plots use Plotly, so an interactivity-enabled viewer is recommended (most notebooks support this).

If a single phenotypic profile is passed as argument, it will plot the symptoms:

from rarecrowds import PhenotypicComparison
fig = PhenotypicComparison(patient = patients['ORPHA:324']['cohort'][0]['phenotype'])

If two phenotypic profiles are passed as argument, it will plot a comparison:

fig = PhenotypicComparison(
    patient = patients['ORPHA:324']['cohort'][0]['phenotype'],
    disease = { # This entry may also be a list of HPO terms.
        'name': patients['ORPHA:324']['name'],
        'id': patients['ORPHA:324']['id'],
        'phenotype': patients['ORPHA:324']['phenotype']})

PhenotypicDatabase

Finally, you may use the PhenotypicDatabase module to pull data from public sources. Currently, these are the supported sources:

Publication Edited Source N. cases
Stavropoulos, 2016 No Rao, 2018 28
Bone, 2016 No Rao, 2018 3
Stelzer, 2016 No Rao, 2018 2
Lee, 2014 No Rao, 2018 200
Kleyner, 2016 Yes Kleyner, 2016 1
Zemojtel, 2014 Added disease ID Supp. 11
Cipriani, 2020 Added disease ID Supp. 134
Tomar, 2019 Added disease ID Supp. 50
Ebiki, 2019 No Supp. 20
ClinVar Subsampled ClinVar 68153
Robinson (Multiple publications) No Robinson 384

Any publication or algorithm stemming from data from the sources above MUST cite the source properly. It is the onus of the publisher to comply with this.

To get an instance of the PhenotypicDatabase:

from rarecrowds import PhenotypicDatabase
db = PhenotypicDatabase()

The PhenotypicDatabase instance manages your local database. You may add data to it by downloading available data or by generating it locally (via simulations or a local push). Available datasets are not in your local database until you explicitly download them. To check what datasets are available and load them for later usage run:

datasets = db.get_available_datasets()
db.load('some_dataset')

In order to dump data from your database, you can get either a pandas dataframe or a list of dictionaries. To get a dataframe of the data in the database:

df = db.generate_dataframe()

To get a list of dictionaries of the data in the database:

data = db.generate_list_of_dicts()

Interesting publications

Relevant publications for disease prediction based on phenotypes

There are many publications exploring the prediction of having a particular rare disease based on a patient's phenotype. The phenotype analysis piece, which may or may not be the central aspect of a publication, largely falls under two categories: ontology- or representation-based algorithms. The ontology-based algorithms define a logic by which distances between terms are calculated based on their position within the ontology and on how common each of them are within the rare diseases (via the information content: IC = -log(p)). The representation-based algorithms compute term representation based on embeddings calculated over a specific dataset. Ideally, the dataset should consist of individual (anonymous) patients in order to gather the most granular information. In the abscence of this option it is recommended to simulated such dataset.

Disease prediction from phenotypes only

Disease prediction from phenotypes and genetic data

References and attributions

The following references need to be added:

Orphanet

  • Reference: Pavan S et al. Clinical Practice Guidelines for Rare Diseases: The Orphanet Database. PLoS One. 2017 Jan 18;12(1):e0170365. doi: 10.1371/journal.pone.0170365. PMID: 28099516; PMCID: PMC5242437.
  • Link: https://www.orpha.net/
  • Logo:

HPO

ClinVar

  • Reference: Landrum MJ, Lee JM, Benson M, Brown GR, Chao C, Chitipiralla S, Gu B, Hart J, Hoffman D, Jang W, Karapetyan K, Katz K, Liu C, Maddipatla Z, Malheiro A, McDaniel K, Ovetsky M, Riley G, Zhou G, Holmes JB, Kattman BL, Maglott DR. ClinVar: improving access to variant interpretations and supporting evidence. Nucleic Acids Res. 2018 Jan 4. PubMed PMID: 29165669
  • Link: https://www.ncbi.nlm.nih.gov/clinvar/
  • Logo:
  • Powered by NCBI:

Other sources