An R package for connecting to chemical and biological databases.
biodb is a framework for developing database connectors. It is delivered with some non-remote connectors (for CSV file or SQLite db), but the main interest of the package is to ease development of your own connectors. Some connectors are already available in other packages (e.g.: biodbChebi, biodbHmdb, biodbKegg, biodbLipidmaps, biodbUniprot) on GitHub. For now, the targeted databases are the ones that store molecules, proteins, lipids and MS spectra. However other type of databases (NMR database for instance) could also be targeted.
With biodb you can:
- Define your own database connector.
- Access entries by accession number and let biodb download them for you.
- Take advantage of the cache system, that saves the results of all sent requests for you. If you send again the same request, the cached result will be used instead of contacting the database. The cache system can be disabled.
- Download locally a downloadable database and access entries by accession number locally.
- Rely on biodb to access correctly the database, respecting the published access policy (i.e.: not sending too much requests). biodb uses a special class for scheduling requests on each database.
- Switch from one database to another easily (providing they offer the same type of information), not changing a line in your code. This is because entries are populated with values found from the database, using always the same keys.
- Search for MS and MSMS spectra by peaks in Mass spectra databases.
- Export any database into a CSV file or record it into an SQLite file.
Install the latest stable version using Bioconductor:
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install('biodb')
You can install the latest development version of biodb from GitHub:
install.packages('devtools')
devtools::install_github('pkrog/biodb', dependencies=TRUE)
Alongside biodb you can install the following R extension packages that use biodb for implementing connectors to online databases:
- biodbChebi for accessing the ChEBI database.
- biodbHmdb for accessing the HMDB database.
- biodbKegg for accessing the KEGG databases.
- biodbUniprot for accessing the UniProt database.
Installation of one of those extension packages can be done with the following command (replace 'biodbKegg' with the name of the wanted package):
devtools::install_github('pkrog/biodbKegg', dependencies=TRUE)
biodb is part of Bioconda, so you can install it using Conda. This means also that it is possible to install it automatically in Galaxy, for a tool, if the Conda system is enabled.
The biodb package contains the following in-house database connectors:
- Compound CSV File (an in-house database stored inside a CSV file).
- Mass CSV File (an in-house database stored inside a CSV file).
- Mass SQLite (an in-house database stored inside an SQLite file).
Here are some of the fields accessible through the retrieved entries (more fields are defined in extension packages):
- Chemical formula.
- InChI.
- InChI Key.
- SMILES.
- Common names and IUPAC names.
- Charge.
- Average mass.
- Monoisotopic mass.
- Molecular mass.
- MS device.
- MS Level.
- MS mode.
- MS precursor M/Z.
- MS precursor annotation.
- Peaks' M/Z values.
- Peaks' intensities.
- Peaks' relative intensities.
- Attributions of peaks.
- Compositions of peaks.
- Peak table.
- Chromatographic column name.
- Chromatographic column length.
- Chromatographic column diameter.
- Chromatographic solvent.
- Chromatographic retention time.
- Chromatographic retention time unit.
Here is an example on how to retrieve entries from ChEBI database and get a data frames of them (you must first install both biodb and biodbChebi packages):
bdb <- boidb::newInst()
chebi <- bdb$getFactory()$createConn('chebi')
entries <- chebi$getEntry(c('2528', '7799', '15440'))
bdb$entriesToDataframe(entries)
All compound databases (ChEBI, Compound CSV File, KEGG Compound, ...) can be searched for compounds using the same function. Once you have your connector instance, you just have to call searchCompound()
on it:
myconn$searchCompound(name='phosphate')
The function will return a character vector containing all identifiers of matching entries.
It is also possible to search by mass, choosing the mass field you want (if this mass particular field is handled by the database):
myconn$searchCompound(mass=230.02, mass.field='monoisotopic.mass', mass.tol=0.01)
Searching by both name and mass is also possible.
myconn$searchCompound(name='phosphate', mass=230.02, mass.field='monoisotopic.mass', mass.tol=0.01)
All mass spectra databases (Mass CSV File and Mass SQLite) can be searched for mass spectra using the same function searchMsEntries()
:
myconn$searchMsEntries(mz.min=40, mz.max=41)
The function will return a character vector containing all identifiers of matching entries (i.e.: spectra containing at least one peak inside this M/Z range).
Annotating a mass spectrum can be done either using a mass spectra database or a compound database.
When using a mass spectra database, the function to call is searchMsPeaks()
:
myMassConn$searchMsPeaks(myInputDataFrame, mz.tol=0.1, mz.tol.unit='plain', ms.mode='pos')
It returns a new data frame containing the annotations.
When using a compound database, the function to call is annotateMzValues()
:
myCompoundConn$annotateMzValues(myInputDataFrame, mz.tol=0.1, mz.tol.unit='plain', ms.mode='neg')
It returns a new data frame containing the annotations.
Defining a new field for a database is done in two steps, using definitions written inside a YAML file.
First we define the new field. Here we define the ChEBI database field for stars indicator (quality curation indicator):
fields:
n_stars:
description: The ChEBI example stars indicator.
class: integer
Then we define the parsing expression to use in ChEBI connector in order to parse the field's value:
databases:
chebi:
parsing.expr:
n_stars: //chebi:return/chebi:entityStar
We now have just to load the YAML file definition into biodb (in extension packages, this is done automatically):
mybiodb$loadDefinitions('my_definitions.yml')
Parsing may be more complex for some fields or databases. In that case it is possible to write specific code in the database entry class for parsing these fields.
Defining a new connector is done by writing two RC classes and a YAML definition:
- An RC class for the connector, named
MyDatabaseConn.R
. - An RC class for the entry, named
MyDatabaseEntry.R
. - A definition YAML file containing metadata about the new connector, like:
- The URLs (main URL, web service base URL, etc.) for a remote database.
- The timing for querying a remote database (maximum number of requests per second).
- The name.
- The parsing expressions used for parsing the entry fields.
- The type of content retrieved from the database when downloading an entry (plain text, XML, HTML, JSON, ...).
For a good starting example of defining a new remote connector, see biodbChebi the ChEBI extension for biodb at https://github.com/pkrog/biodbChebi. In particular:
A set of classes and methods are provided by biodb to generate a skeleton of
a new repository for a new connector. The easiest way to use this feature is
through the method biodb::genNewExtPkg()
.
Here is an example which creates an new repository for a new connector to the
Foo remote database on how to use it with some comments:
biodb::genNewExtPkg(
path = 'the/path/to/biodbFoo', # The repository folder.
# pkgName = 'myName', # By default the laste folder of `path` is used
# so you do not need to modify it.
email = 'your@e.mail', # The author's email.
dbName = 'foo.db', # The connector name that will be used by biodb.
dbTitle = 'Foo database', # A short description of the connector's database.
# pkgLicense = '...', # The generated license is always AGPL-3.
firstname = 'Your firstname',
lastname = 'Your lastname',
connType = 'compound', # Use 'mass' for an MS database or 'plain' for any
# other type. Run `biodb::getConnTypes()` to get a
# full list of all available types.
entryType = 'txt', # Other possible types are: 'plain', 'csv',
# 'html', 'json', 'list', 'sdf' and 'xml'.
# Run `biodb::getEntryTypes()` to get a full list
# of all available types.
editable = FALSE, # If the database is editable in memory.
writable = FALSE, # If the database is writable on disk (like a CSV
# file).
remote = TRUE, # If the database is accessed through web protocol
# like HTTPS, as oppose to local database stored
# inside an SQLite file or a CSV file.
downloadable = FALSE, # Set it to TRUE for a remote database that allows
# the download of its full content (e.g.: through
# the download of a zip file).
makefile = TRUE, # Generate a Makefile file, useful for maintenance
# UNIX/Linux systems.
rcpp = FALSE, # If set to TRUE, the package will be configured
# to use Rcpp and skeleton files will be generated
# with examples and test examples.
# vignetteName = '...', # By default the vignette name will be the package
# name.
githubRepos = 'id/repos' # The repository URL on GitHub (e.g.:
# 'pkrog/biodbChebi').
)
Once in R, you can get an introduction to the package with:
?biodb
Then each class has its own documentation. For instance, to get help about the
BiodbFactory
class:
?biodb::BiodbFactory
Several vignettes are also available. To get a list of them run:
vignette(package='biodb')
To open a vignette in a browser, use its name:
vignette('new_connector', package='biodb')
If you wish to contribute to the biodb package, you first need to create an account under GitHub. You can then either ask to become a contributor or fork the project and submit a merge request.
Debugging, enhancement or creation of a database connector or an entry parser are of course most welcome.