README.Rmd

---
output: github_document
---

<!-- README.md is generated from README.Rmd. Please edit that file -->

```{r, echo = FALSE, message=FALSE, warning=FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "README-"
)
```

# plyranges: fluent genomic data analysis <img id="plyranges_logo" src="man/figures/logo.png" align="right" width = "125" />

<!-- badges: start -->
[![R-CMD-check-bioc](https://github.com/tidyomics/plyranges/workflows/R-CMD-check-bioc/badge.svg)](https://github.com/tidyomics/plyranges/actions?query=workflow%3AR-CMD-check-bioc)
[![BioC status](http://www.bioconductor.org/shields/build/release/bioc/plyranges.svg)](https://bioconductor.org/checkResults/release/bioc-LATEST/plyranges)
<!-- badges: end -->

[plyranges](https://www.bioconductor.org/packages/release/bioc/html/plyranges.html) provides a consistent interface for importing and wrangling
genomics data from a variety of sources. The package defines a grammar of
genomic data transformation based on `dplyr` and the Bioconductor packages
`IRanges`, `GenomicRanges`, and `rtracklayer`. It does this by providing a set
of verbs for developing analysis pipelines based on _Ranges_ objects that
represent genomic regions:

* Modify genomic regions with the `mutate()` and `stretch()` functions.
* Modify genomic regions while fixing the start/end/center coordinates with the `anchor_` family of functions.
* Sort genomic ranges with `arrange()`.
* Modify, subset, and aggregate genomic data with the `mutate()`,
`filter()`, and `summarise()`functions.
* Any of the above operations can be performed on partitions of the
data with `group_by()`.
* Find nearest neighbour genomic regions with the `join_nearest_` family
of functions.
* Find overlaps between ranges with the `join_overlaps_` family of functions.
* Merge all overlapping and adjacent genomic regions with `reduce_ranges()`.
* Merge the end points of all genomic regions with `disjoin_ranges()`.
* Import and write common genomic data formats with the `read_/write_` family
of functions.

For more details on the features of plyranges, read the
[vignette](https://tidyomics.github.io/plyranges/articles/an-introduction.html).
For a complete case-study on using plyranges to combine ATAC-seq and RNA-seq
results read the [*fluentGenomics*
workflow](https://tidyomics.github.io/fluentGenomics).

plyranges is part of the [tidyomics](https://github.com/tidyomics)
project, providing a `dplyr`-based interface for many types of
genomics datasets represented in Bioconductor.

# Installation

[plyranges](https://www.bioconductor.org/packages/release/bioc/html/plyranges.html) can be installed from the latest Bioconductor
release:

```{r, eval=FALSE}
# install.packages("BiocManager")
BiocManager::install("plyranges")
```

To install the development version from GitHub:

```{r, eval=FALSE}
BiocManager::install("tidyomics/plyranges")
```

# Quick overview

## About `Ranges`

`Ranges` objects can either represent sets of integers as `IRanges` (which have
start, end and width attributes) or represent genomic intervals (which have
additional attributes, sequence name, and strand) as `GRanges`.  In addition,
both types of `Ranges` can store information about their intervals as metadata
columns (for example GC content over a genomic interval).

`Ranges` objects follow the tidy data principle: each row of a `Ranges` object
corresponds to an interval, while each column will represent a variable about
that interval, and generally each object will represent a single unit of
observation (like gene annotations).

We can construct a `IRanges` object from a `data.frame` with a `start` or
`width` using the `as_iranges()` method.

```{r, message=FALSE}
library(plyranges)
df <- data.frame(start = 1:5, width = 5)
as_iranges(df)
# alternatively with end
df <- data.frame(start = 1:5, end = 5:9)
as_iranges(df)
```

We can also construct a `GRanges` object in a similar manner. Note that a
`GRanges` object requires at least a seqnames column to be present in the
data.frame (but not necessarily a strand column).

```{r}
df <- data.frame(seqnames = c("chr1", "chr2", "chr2", "chr1", "chr2"),
                 start = 1:5,
                 width = 5)
as_granges(df)
# strand can be specified with `+`, `*` (mising) and `-`
df$strand <- c("+", "+", "-", "-", "*")
as_granges(df)
```

# Example: finding GWAS hits that overlap known exons
Let's look at a more a realistic example (taken from HelloRanges vignette).

```{r, include=FALSE}
dir <- system.file(package = "HelloRangesData", "extdata/")
genome <- as_granges(read.delim(file.path(dir, "hg19.genome"),
                     header = FALSE),
                     seqnames = V1, start = 1L, width = V2)

gwas <- read_bed(file.path(dir, "gwas.bed"), genome_info = genome)
exons <- read_bed(file.path(dir, "exons.bed"), genome_info = genome)
```

Suppose we have two _GRanges_ objects: one containing coordinates of known
exons and another containing SNPs from a GWAS.

The first and last 5 exons are printed below, there are two additional columns
corresponding to the exon name, and a score.

We could check the number of exons per chromosome using `group_by` and
`summarise`.
```{r}
exons
exons %>%
  group_by(seqnames) %>%
  summarise(n = n())
```

Next we create a column representing the transcript_id with `mutate`:

```{r}
exons <- exons %>%
  mutate(tx_id = sub("_exon.*", "", name))
```

To find all GWAS SNPs that overlap exons, we use `join_overlap_inner`. This
will create a new _GRanges_ with the coordinates of SNPs that overlap exons, as
well as metadata from both objects.

```{r}
olap <- join_overlap_inner(gwas, exons)
olap
```

For each SNP we can count the number of times it overlaps a transcript.

```{r}
olap %>%
  group_by(name.x, tx_id) %>%
  summarise(n = n())
```

We can also generate 2bp splice sites on either side of the exon using
`flank_left` and `flank_right`. We add a column indicating the side of flanking
for illustrative purposes. The `interweave` function pairs the left and right
ranges objects.

```{r}
left_ss <- flank_left(exons, 2L)
right_ss <- flank_right(exons, 2L)
all_ss <- interweave(left_ss, right_ss, .id = "side")
all_ss
```

# Learning more

- The [*fluentGenomics* workflow](https://sa-lee.github.io/fluentGenomics) package shows you how to combine differential expression genes and differential chromatin accessibility peaks using plyranges. It extends the [case study](https://github.com/mikelove/plyrangesTximetaCaseStudy) by Michael Love for using plyranges with [tximeta](https://bioconductor.org/packages/release/bioc/html/tximeta.html).

- The [extended vignette in the plyrangesWorkshops package](https://github.com/sa-lee/plyrangesWorkshops) has a detailed
walk through of using plyranges for coverage analysis.

- The [Bioc 2018 Workshop book](https://bioconductor.github.io/BiocWorkshops/fluent-genomic-data-analysis-with-plyranges.html) has worked examples of using `plyranges` to analyse publicly available genomics data.


# Citation

If you found `plyranges` useful for your work please cite our
[paper](http://dx.doi.org/10.1186/s13059-018-1597-8):

```
@ARTICLE{Lee2019,
  title    = "plyranges: a grammar of genomic data transformation",
  author   = "Lee, Stuart and Cook, Dianne and Lawrence, Michael",
  journal  = "Genome Biol.",
  volume   =  20,
  number   =  1,
  pages    = "4",
  month    =  jan,
  year     =  2019,
  url      = "http://dx.doi.org/10.1186/s13059-018-1597-8",
  doi      = "10.1186/s13059-018-1597-8",
  pmc      = "PMC6320618"
}
```

# Contributing

We welcome contributions from the R/Bioconductor community. We ask that
contributors follow the [code of conduct](.github/CODE_OF_CONDUCT.md) and the guide
outlined [here](.github/CONTRIBUTING.md).