Test Orthology Datasets

Summary

This webpage describes the tests performed with the transformed datasets.

The aim of the tests is to show the interoperability between different data sources, and also between different data structures ("flat" ortholog grouping and hierarchical ortholog grouping) for orthology.

Hierarchical ortholog grouping considered:

hoge

Y1 and Z1 is orthologous to X1
X2 and Y2 is paralogous to X1

Datasets used

Inparanoid: This database offers flat, pairwise clusters of orthologs as a [set of OrthoXML files] (http://inparanoid.sbc.su.se/download/8.0_current/Orthologs_OrthoXML/) (tens of GB). Each file provides the orthologs for two species.
OMA: This database offers the content as a [single OrthoXML file] (http://omabrowser.org/All/oma-groups.orthoXML.xml.gz) (around 870MB).
TreeFam: This database offers hierarchical clusters as a single OrthoXML file (around 640MB)

How we are transforming the data

The transforming process is driven by the mapping OrthoXML2OrthologyOntology that we have defined between OrthoXML and the [Ontology Orthology] (http://purl.org/net/orth).
The transformation is executed using SWIT. We are not using the SWIT web interface, but a command line version.
After the transformation, the RDF content is made available in the MBGD servers (see Results section)

Note: Although SWIT is able to generate RDF content and to automatically store it in Virtuoso, the tests have been carried out by generating the content in OWL format.

Status

Inparanoid: The H.sapiens-M.Musculus file has been transformed
OMA: The complete database has been transformed.
TreeFam: The complete database has been transformed

Findings

The basic mapping file has demonstrated to be able to transform the content of the three databases, but a few interesting issues have arisen. Most of the content has been transformed using the base OrthoXML2OO mapping file, but some resource-specific rules have been necessary.

Taxonomic range: The taxonomic range provides information of interest for analyzing orthology relations, because they define at which taxon the relation holds. Unfortunately, the OrthoXML format does not provide any specific tag for this purpose, but a generic "property" tag, which enables to define any additional property by providing the pair (name,value). OMA and TreeFam use this "property" for storing the taxonomic range, but in different ways. OMA use "taxRange" as the name of the "property", which stores the name of the taxon. TreeFam uses "taxon_id" and "taxon_name" for the NCBI ID and the name of the taxon. This lack of standardization has made us to define specific mapping rules for these properties. An extension to OrthoXML to provide standard tags would permit to unify the mapping rules.
Clusters and identifiers: Clusters of orthologs may have an id associated in OrthoXML, which usually have a local meaning. These identifiers are used to create the URI of the clusters. This is fine with the resources which provide the whole database in a single file (like OMA or TreeFam), but it generates some conflict when creating the content from multiple files (Inparanoid). Provided that the SWIT transformation tool permits to define the structure of the URI of the dataset, we are including in the case of Inparanoid the name of the two species associated with the file in the URI.
Gene/Protein identifiers: OrthoXML provides the attribute, geneID and protID to include the identifiers of the genes and proteins. These identifiers should be transformed into external links. Each resource use the identifiers from different databases. The name of the database used is usually provided in free text (for instance, "Homo sapiens from ENSEMBL v75") in OrthoXML, but not its URI. Moreover, sometimes this geneID is a code (like Inparanoid) and sometimes This has the practical implication that we cannot generate the complete links from the content of the OrthoXML file in a generic way, therefore resource-specific rules are needed. Such rule would simply create the URI corresponding to the identifier by concatenating the URI of the resource and the identifier. One option would be to use identifiers.org URIs. In this case, if the gene/protein database is included in the OrthoXML file, we could have a common rule for all the resources. SWIT uses the same URI prefix for all the content generated using the basic mapping rules. The SWIT transformation tool provides transformation patterns, which can be used for this purpose as well, but patterns are meant for more complex transformations, so it would be like to kill a flea with a sledgehammer. Unfortunately, the SWIT command line version used for these tests do not permit using patterns. We are considering to extend SWIT mapping rules to directly enable the definition of specific URI for external links since this would make the life of SWIT users easier.
Integration: SWIT provides the possibility of merging data instances by using identity conditions. For instance, we could establish that all the data instances of protein with the same value for protID are the same protein and, therefore, the data get merged. This is of particular interest when building a common RDF repository or OWL file from multiple resources. Given that in this test we have performed different transformation processes for the three resources and that the content has been generated in different OWL files, using this SWIT option did not make sense. A further task will be to run a transformation pipeline for the three resources that includes the generation of the content in RDF in Virtuoso and takes into account the identity conditions. This will also enable us to study the heterogeneity of the IDs used for genes and proteins.

Results

Queries and results

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX obo: <http://purl.obolibrary.org/obo/>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX void: <http://rdfs.org/ns/void#>
PREFIX orthology: <http://miuras.inf.um.es/ontologies/orthology.owl#>

SELECT DISTINCT (STR(?id1) AS ?id1) ?label1 (STR(?id2) AS ?id2) ?label2 ?resource WHERE{
    ?common_ancestor rdf:type/rdfs:subClassOf* orthology:OrthologsCluster.
    ?common_ancestor orthology:hasHomologous ?cluster_gene1.
    ?common_ancestor orthology:hasHomologous ?cluster_gene2.
    ?common_ancestor void:inDataset ?dataset.
    ?dataset orthology:hasSource ?resource.
    ?cluster_gene1 orthology:hasHomologous* ?gene1.
    ?cluster_gene2 orthology:hasHomologous* ?gene2.
    ?gene1 rdf:type orthology:OrthologyGene.
    ?gene2 rdf:type orthology:OrthologyGene.
    ?gene1 obo:RO_0002162 ?specie1.
    ?gene2 obo:RO_0002162 ?specie2.
    ?gene1 orthology:identifier ?id1.
    ?gene2 orthology:identifier ?id2.
    ?specie1 rdfs:label ?label1.
    ?specie2 rdfs:label ?label2.
    FILTER(?cluster_gene1 != ?cluster_gene2 && ?specie1 != ?specie2 && regex(?label1,'Homo sapiens') && regex(?label2,'Mus musculus')).
}

PREFIX obo: <http://purl.obolibrary.org/obo/>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX void: <http://rdfs.org/ns/void#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX orthology: <http://miuras.inf.um.es/ontologies/orthology.owl#>

SELECT DISTINCT (STR(?id1) AS ?id1) ?label1 (STR(?id2) AS ?id2) ?label2 ?resource WHERE{
    ?common_ancestor rdf:type/rdfs:subClassOf* orthology:OrthologsCluster.
    ?common_ancestor orthology:hasHomologous ?cluster_gene1.
    ?common_ancestor orthology:hasHomologous ?cluster_gene2.
    ?common_ancestor void:inDataset ?dataset.
    ?dataset orthology:hasSource ?resource.
    ?cluster_gene1 orthology:hasHomologous* ?gene1.
    ?cluster_gene2 orthology:hasHomologous* ?gene2.
    ?gene1 rdf:type orthology:OrthologyGene.
    ?gene2 rdf:type orthology:OrthologyGene.
    ?gene1 obo:RO_0002162 ?specie1.
    ?gene2 obo:RO_0002162 ?specie2.
    ?gene1 orthology:identifier ?id1.
    ?gene2 orthology:identifier ?id2.
    ?specie1 rdfs:label ?label1.
    ?specie2 rdfs:label ?label2.
    FILTER(?cluster_gene1 != ?cluster_gene2 && ?specie1 != ?specie2 && regex(?label1,'Homo sapiens') && regex(?label2,'Mus musculus') && regex(?id1,'OR2AG1')).
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly