Skip to content

Latest commit

 

History

History
73 lines (46 loc) · 18 KB

UseCases.md

File metadata and controls

73 lines (46 loc) · 18 KB

Introduction

This page is seeded with the use case collection assembled in preparation for the May 2012 Semantics of Biodiversity workshop. All are invited to organize, to add comments and questions, and to add further issues or usage scenarios.

Details

1. Images, media and other linked items (J. Deck / C. Meyer, adapted from BiSciCol use cases) A large marine organism is sampled, including tissues and photographs. Samples are sent for sequencing and entered into Genbank while selected images are deposited at Morphbank, which adds its own codes. How do we link the physical specimen to the image itself, the image metadata, and the sequence metadata in Genbank? What is the nature of the relationship between the relevant components: the collector, the marine organism itself, what we call this thing (scientific name) photograph, the photographic metadata, and the sequence metadata? (Bob Morris: Much(?) of this is covered in the impending Audubon Core proposal http://species-id.net/wiki/Audubon_Core )

2. The subject of an identification (S. Baskauf, +1 from A. Thomer) This is not really a separate issue from the one above, which is why I put it in brackets. When an Identification (in the sense of a taxonomic determination = applying a scientific name, NOT in the sense of applying a URI) is applied by an agent, what is the class of resource that is being identified? People say they have identified a specimen, but are they really identifying the organism from which the specimen was taken (think a branch removed from a tree)? What if an expert looks at an image of a flower on tree and “identifies” it. Are they actually identifying the image, the flower, or the tree? Or are they identifying all of these things at once (the branch, the tree, and the image)?] (Bob Morris: it is an incredibly bad idea to assign the same URI to a physical object and to a description of the object (or other digital data with a named relation to the physical object.) Thus, as long as there are two URIs, the expert can make assertions about either or both, as she chooses, and there can be no confusion.)

3. Flagging inconsistency/discrepancies in NCBI taxonomy (H.Bik, use case relevant to next-gen sequencing analysis) Because of issues with licensing and reuse associated with other taxonomic nomenclatures (e.g. SILVA), we continue to rely on NCBI taxonomy for annotating environmental sequence data (e.g. large 454, Illumina datasets). However, NCBI taxonomy is extremely messy and not concordant with phylogenetic structure for many groups. For example, in nematodes the NCBI hierarchy is more consistent with historical morphological classifications than the most recent knowledge about evolutionary relationships inferred from molecular phylogenies. How do link taxonomic information from different sources (e.g. TreeBase, SILVA, greengenes, the OpenTree project) and flag inconsistencies in NCBI’s hierarchy?

  1. Tracking down experimentally validated versus sequence-homology based taxonomic and functional annotations (J.Gilbert, use case relevant for EMP taxonomic and functional annotation) Whether you can ‘trust’ an annotation for a nucleotide sequence fragment is of vital importance to massive-scale efforts such as the EMP. Hence tracking nomenclature back to the original sequence that was experimentally validated is of vital import. Enabling tracking of citation, annotation-distance (how likely it is that a sequence was annotated by a experimentally-validated homolog), and standard sequence similairty metrics is of vital interest for cleaning up annotations and providing appropriate metrics.

  2. Morphological and ecological observations of organisms in the field, and for taxa (C. Webb) How do we represent collection label observations (“flowers yellow, growing on rocks”), and link these observations to individual organisms and specimens? How do we record general phenotypes for taxa? The EQ (entity quality) model has many advantages but has yet to be formalized for the semantic web. One approach is to make an oboe:Observation of a pato:Quality rel:inherrent_in a obo:AnatomicalStructure which is rel:part_of a dsw:IndividualOrganism; the oboe:Observation also serves as a dsw:Token which is dsw:evidenceFor a dwc:Occurrence of the same dsw:IndividualOrganism. But there are many other ways of solving this. See, e.g., TDWG 2011 poster of Balhoff et al.

  3. Expressive semantics to inter-relate observations relevant for understanding biodiversity patterns in nature (M. Schildhauer) This is somewhat similar to Webb’s use case above, but involves thinking about the large numbers of ecological data sets that include references to an occurrence (e.g. taxon name) in the field, but are also associated with various other bio and earth science measurements--e.g. size (biomass|wing span) of that specimen, or the context in which it is situated (potentially another taxon, for example some “Bird.sp” nesting in some “Tree.sp”; or occurring in some second-growth forest, or situated at XX lat/YY lon, etc). How do we clarify in a tuple how these attributes are inter-related, and which represent normative measurements (mean values) as opposed to actual in situ measurements, or a mix of these, or which in situ measurements pertain to the same individual/specimen (e.g. when several morphometric measurements are in a tuple that is describing several species’ interactions)? The OBOE’ model offers one way to clarify these semantics, and is potentially compatible with the ISO O&M standard as well...I mention “Expressive” in the title to indicate that we might want also to consider what reasoning capabilities we hope to derive from these semantic investments.

  4. Clarifying what is meant by an annotation and tracking physical and digital annotations (sensu lato) through time (G. Nelson, scenario abstracted from a herbarium digitization project) Traditional meanings of annotation are taxonomic in nature and usually refer to the process of recording successive determinations. However, specimens, especially herbarium sheets, often include additional markings and notes that should also be linked to specimen data and media, as:

  • axonomic determinations subsequent to those of the collector, attributed or unattributed,
  • hanges in higher taxonomic rank (usually family), which are inconsistent with the rank applied to the label at time of collection,
  • ubsequent notes highlighting important morphological features, which may be missing or present on the specimen,
  • istorical notes from the collector, separate from the determination label, detailing the collecting event and/or its significance,
  • istorical notes generated from subsequent visits to a collecting site, noting persistence, abundance, destruction, other,
  • ubsequent latitude/longitude georeferences, appended by the collector or other georeferencer,
  • oted uses of the specimen in papers, reports, monographs, or for DNA extraction,
  • ubsequent addition of measurements.

Example (superscripts denote data/media elements): A plant specimen is collected in 1921 and, when mounted, is determined1 as species A B. A note2 affixed to the sheet at the time of preparation but separate from the label, includes details of the collecting event not necessarily related to the occurrence. An annotation label3 appended to the sheet in 1949 transfers the species to the newly erected genus C, resulting in determination to species C B (the species name under which the sheet is currently filed). A note4 appended to the sheet in 1991 denotes that the specimen was reviewed during a generic revision of genus C. Undated markings and handwritten notes5 on the sheet, presumably made at the time of the 1991 note, call attention with arrows and handwritten explanations to exemplary morphological features clearly evident on the sheet. The folder which contains the sheet is stored6 within a family other than that noted on the original label, but without comment, presumably the result of classification changes by APG or Flora of North America. In 2011, the specimen is imaged7 and the specimen data captured, with both served online via an institutional web portal and an associated portal of a collaborating institution, resulting in a single database record and two images in separate repositories8. In 2012 an image and data record from the originating institution are contributed to iDigBio9 and a DWc-A independently registered at GBIF10. Later in 2012 an electronic annotation11 of this specimen is created through the institution’s web portal, identifying the organism represented by the image as species C D. Even later in 2012, the record is georeferenced13 via a crowd sourcing project at the original institution, with the georeferenced data stored in the database serving the institution’s portal. How do we treat each of these additions to the physical specimen and or/digital record, and how do we ensure that data are consistent and linked across all versions of this record?

  1. Retrieving Environmental Data based on the Environmental Conditions / Contextualisation at the time of sampling (N.Morrison) Access to data associated with biological specimens, samples and observations (SSOs) can be achieved through a number of different routes. Examples include an SSO’s molecular signature (i.e. its DNA sequence, protein sequence, RNA sequence, etc), geographic coordinates, phenotypic characteristics, timestamp, or the characteristics of its environment. Descriptions of an SSO’s environment are rarely well-controlled and difficult to rely upon for data access or comparison. However, these descriptions are a vital source of contextual information, particularly when exact measurements of environmental parameters are not available.

  2. Mapping Sampling Design onto an Ontology (H. Schentz): https://www.umweltbundesamt.at/fileadmin/site/daten/Ontologien/SERONTO/SERONTOCore20090205.owl

I do not think that it makes sense to bring in the whole ontology, but I could bring in the ideas or the rough conceptual model of the sampling description: it mostly is about the sampling design (how do samples relate to each other?)

  1. Bridging Controlled Vocabularies (H. Schentz):

https://secure.umweltbundesamt.at/envThes/en/hierarchical_concepts.html

I think that the interlinkage of such controlled vocabularies (with GEMET, AGROVOC, EARTh) can at some point serve as an example for how linked data can help to bridge the vocabularies, taxonomies of differing groups.

  1. Habitat vs. environment/environmental material (J. Deck) reef/water/coral may all be suitable descriptions... How to reconcile habitat for microbes (e.g. ocean water) vs. common habitat descriptors for macro-biota (e.g. inner-reef)

  2. Best practices for collaborative management of RDF vocabularies (D. Endresen) Strategies to maximize reuse of terms and avoid re-declaration of the same semantic meaning in multiple vocabulary terms. Best practices to manage collaborative development and maintenance of RDF vocabularies. Lowering the barrier for participation and domain expert (non-technical) contributions using the Semantic MediaWiki as a platform? See also: http://community.gbif.org/pg/groups/21382/vocabulary-management/

  3. Using Darwin Core to capture all elements of name/taxon dataset exchange Does current DwC vocabulary enable all of the data elements needed for a full exchange of checklist records, or a standard nomenclatural list? There appears to be a disconnect between “name” data used for specimen/observation/occurence identification datasets, and “name” data used for checklist/taxonomic relationshp datasets. Are these two use cases different enough to warrant extended/added vocabulary?

  4. Not challenging in principle, but would this provide useful leverage? (John Kunze) How about assigning unique persistent prefixes to museums, so that app developers could then append the museum's own unique specimen ids to create globally unique and persistent specimen ids across all the museums? Identifier management services (eg, EZID) and resolvers (eg, n2t.net) exist for this kind of thing.

  5. Making the leap: from vocabularies to ontologies, from spreadsheets to linked data, from specifications to implementations (A. Matsunaga) A number of very specific vocabularies have been created and are either ratified biodiversity standards or are in the process of being ratified by an organization (Dublin core terms, Darwin Core terms, Audubon Core terms, ABCD). Structural “standards” also exist (EML, ABCD, ABCDEFG, ABCDDNA, KML, DwC-A), but data management tools developed not always keep up with standards. How can we move from defining terms in a vocabulary to creating ontologies capable of expressing structure? It is not yet clear to me if BFO works, but features that were presented or discussed and I would like to see implemented in an ontology for biodiversity: (a) clear precise definitions with examples (e.g., define Taxon), (b) structure and relationships with definitions, (c) higher concepts for divide and conquer minimizing silos, (d) ontology versioning, (e) ontology reuse, and (f) killing bad ontologies. How can we move from managing data in spreadsheets to relational structures for concepts with durable GUIDs? iDigBio issued “Guidelines for Managing Persistent Identifiers” that explains the importance of the durability of unique identifiers for objects that change properties over time, and proposes the adoption of URI scheme for creating globally unique identifiers. Even though it is tempting to use spreadsheets for managing data because of several reasons, it requires some skills to ensure proper management of persistent identifiers. Therefore the biodiversity community needs to move on to use other data management systems that able to facilitate durability of identifiers and establish relationships. Having clear definition of objects still depends on the previous question. How can we move from having specifications to tools that implement such standards? What about when the standards fall short? What tools are necessary to allow two-way communication?

  6. What does our RDF Graph look like? (J. Sachs) What does our data look like? What design patterns do we use in our ontologies? How tractable are they, especially in a distributed environment? How easy is it to make mistakes? What are the minimum changes necessary to transform http://rs.tdwg.org/dwc/rdf/dwcterms.rdf into a foundation for a decent semantic web for biodiversity informatics? I'm particularly interested in understanding these questions in the context of citizen science.

  7. Modeling Invasiveness (J. Sachs) There was (at least one) long discussion about this on tdwg-content, beginning around here . Questions that came up include: "How should we structure the controlled vocabularies/ontologies which will be used to value the dwc:establishmentMeans property"; and "what entities (observations or taxa) should be considered as invasive?"

  8. Annotating data/Filtered Push (J. Sachs, on behalf of the FP/AOD folks with additions by Bob Morris). Gil Nelson's example and Jack Gilbert's use case (both above) can be used to motivate several aspects of the Data Annotation Ontology (AOD). Examples of using AOD in the context of that were motivated by experiences of Filtered Push can be found here. AOD is being merged into the Annotation Ontology derivation of Annotea which in turn is being merged into with the Open Annotation Consortium effort which in turn is an integrated effort at http://www.w3.org/community/openannotation/. The expectation is that this effort will lead to a W3C Recommendation for the annotation of digital objects.

  9. Data fusion and integration across same/different observatories of a given quality that are coincident for a given spatial - temporal footprint (B. Wee) Define “data fusion” as combining the “same” (= scientifically “equivalent”) type of data across multiple sources (e.g. CO2 flux measurements of a certain given quality across different environmental observatories, e.g. somebody trying to fuse data from DOE’s Ameriflux and NEON). Define “data integration” as doing correlational / data-model-assimilation analysis using “different” types of data across the same, or different data sources (e.g. airborne lidar acquisitions used to estimate canopy biomass over a given area correlated with biomass estimated from ground-based techniques, where the two types of data and ground-based measurements may be from different observation systems and therefore of be different quality). We foresee many cases where scientists will need to do data fusion and data integration by using data published by different credible data sources. How does one develop an ontology that would enable a search mechanism to allow exploration / discovery of data sets that meet a user’s requirement along the following axes: the data must: (1) be of a certain given quality (e.g. via ISO 19138 / ISO 19157 specifications), (2) be within a given spatial constraint (e.g. “within the footprint or in the vicinity of the XYZ LTER site”), (3) be within a given temporal constraint (easy to implement, not quite a challenge), (4) a certain measurement type (e.g. precip is a way to describe both rainfall as well as snow, but there is a more tricky case of nitrogen species being described in various ways by various communities). But I would like to call special attention to data quality descriptors, given that scientists may wish to use a particular spectrum along the “professionally” collected and processed data (e.g. NEON) and those generated by sources that have less control over QA/QC because of various constraints.

  10. Species and Status and Observation (by Olivier Rovellotti) It would be nice to be able to provide fieldworkers with a list of taxon of interest when surveying a specific area. Either by linking IUCN/DBpedia or by building triples of local status with their geographic scope. Spatial RDF/Sparql is crucial to most of the use case we are confronting to in the real world.