The goal of taxastand
is to standardize species names from different
sources, a common task in biology.
Very often different biologists use different synonyms to refer to the
same species. If we want to join data from different sources, their
taxonomic names must be standardized first. This is what taxastand
seeks to do in a reproducible and efficient manner.
This package is in early development. There may be major, breaking changes to functionality in the near future. If you use this package, I highly recommend using a package manager like renv so that later updates won’t break your code.
taxastand
is based on matching names to a single taxonomic
standard, that is, a database of accepted names and synonyms. As long
as a single taxonomic standard is used, we can confidently resolve names
from disparate sources.
The taxonomic standard must conform to Darwin Core standards. The user must provide this database (as a dataframe). There are many sources of taxonomic data online, including GBIF, Catalog of Life, and ITIS to name a few. The taxadb package provides convenient functions for downloading various taxonomic databases that use Darwin Core.
taxastand
can be installed from
r-universe or
github.
install.packages("taxastand", repos = 'https://joelnitta.r-universe.dev')
OR
# install.packages("remotes")
remotes::install_github("joelnitta/taxastand")
taxastand
depends on
taxon-tools for taxonomic name
matching.
There are two options for using this dependency.
- Install docker and set
docker = TRUE
when usingtaxastand
functions.
OR
- Install the two programs included in
taxon-tools,
parsenames
andmatchnames
.
-
ROpenSci has a task view summarizing many tools available for taxonomy.
-
taxize is the “granddaddy” of taxonomy packages in R. It can search around 20 different taxonomic databases for names and retrieve taxonomic information.
-
TNRS, the Taxonomic Name Resolution Service, is a web application that resolves taxonomic names of plants according to one of six databases.
-
taxizedb downloads taxonomic databases and provides tools to interface with them through SQL.
-
taxadb also downloads and searches taxonomic databases. It can interface with them either through SQL or in-memory in R.
-
taxonstand has a very similar goal to
taxastand
, but only uses The Plant List (TPL) as its taxonomic standard and does not allow the user to provide their own. Note that TPL is no longer being updated as of 2013.
Although existing web-based solutions for taxonomic name resolution are very useful, they may not be ideal for all situations: the choice of reference database to use for standardization is limited, they may not be able to handle very large queries, and the user has no guarantee that the same input will yield the same output at a later date due to changes in the remote database.
Furthermore, matching of taxonomic names is not straightforward, since they are complex data structures including multiple components (e.g., genus, specific epithet, basionym author, combination author, etc). Of the tools mentioned above only TNRS can fuzzily match taxonomic names based on their parsed components, but it does not allow for use of a local reference database.
The motivation for taxastand
is to provide greater flexibility and
reproducibility by allowing for complete version control of the code and
database used for name resolution, while implementing fuzzy matching of
parsed taxonomic names.
Here is an example of fuzzy matching followed by resolution of synonyms using the dataset included with the package.
library(taxastand)
# Load example reference taxonomy in Darwin Core format
data(filmy_taxonomy)
# Take a look at the columns used by taxastand
head(filmy_taxonomy[c(
"taxonID", "acceptedNameUsageID", "taxonomicStatus", "scientificName")])
# As a test, resolve a misspelled name
ts_resolve_names("Gonocormus minutum", filmy_taxonomy)
# We can now use the `resolved_name` column of this result for downstream
# analyses joining on other datasets that have been resolved to the same
# reference taxonomy.
#> taxonID acceptedNameUsageID taxonomicStatus
#> 1 54115096 NA accepted name
#> 2 54133783 54115097 synonym
#> 3 54115097 NA accepted name
#> 4 54133784 54115098 synonym
#> 5 54115098 NA accepted name
#> 6 54133785 54115099 synonym
#> scientificName
#> 1 Cephalomanes atrovirens Presl
#> 2 Trichomanes crassum Copel.
#> 3 Cephalomanes crassum (Copel.) M. G. Price
#> 4 Trichomanes densinervium Copel.
#> 5 Cephalomanes densinervium (Copel.) Copel.
#> 6 Trichomanes infundibulare Alderw.
#> query resolved_name
#> 1 Gonocormus minutum Crepidomanes minutum (Bl.) K. Iwats.
#> matched_name resolved_status matched_status match_type
#> 1 Gonocormus minutus (Bl.) Bosch accepted name synonym auto_fuzzy
If you use this package, please cite it! Here is an example:
Nitta, JH (2021) taxastand: Taxonomic name standardization in R. https://doi.org/10.5281/zenodo.5726390
The example DOI above is for the overall package.
Here is the latest DOI, which you should use if you are using the latest version of the package:
You can find DOIs for older versions by viewing the “Releases” menu on the right.
You should also cite the software that taxastand
relies on,
taxon-tools
: https://github.com/camwebb/taxon-tools