-
Notifications
You must be signed in to change notification settings - Fork 18
API Design
The overarching goal is to distill the IRanges API into something more user-friendly.
Specific goals:
- API entry points should be chainable, endomorphic verbs,
- We intentionally avoid exposure of accessors, and intermediate data structures, although the implementation will be based on them,
- We will not use any S4 generics, instead relying on dispatch by the low-level API. For meaningful error messages, we will probably need some way to assert that an object satisfies the implicit contract of the function.
For compatibility, we will use the existing data structures from the IRanges family of packages. The API will treat these as tables, despite the Bioc API regarding them as vectors.
For the sake of simplicity, some data structures will not be handled by the API (explicitly; some might end up working through duck typing). The primary supported data structures include:
- Ranges
- GRanges
- SummarizedExperiment (SE)
These notes apply primarily to the Ranges API, but at least some of the operations will also be valid for GRanges and even SE.
GRanges obviously introduces the notions of sequence and strand. Could we generalize this?
With RangedData, we tried to represent sequence as the “space”. We could also think of the sequence as a default grouping variable, and the user will have the option to group by an arbitrary column.
Strandedness seems to have an analog with directedness in graphs. We define a “directed” range as one that has an orientation, proceeding either from start or the end. Of course, “start” and “end” connote a direction, at least a meaningful default for “undirected” ranges.
GRangesLists typically arise with gene structures and read alignments. Gene structures typically come from a TxDb, and we can make the selection and shape of the data explicit:
txdb %>% exons() %>% group_by(tx_id)
We could represent GRangesList using a grouped GRanges, but it is
not clear if that is semantically appropriate, because the groups
are being treated as records/rows in the dataset, rather than as a
specification for aggregation. We also want to preserve the
GRangesList if possible, since many methods have been written for
it. Thus, the API should try to treat a GRangesList as a nested type
of GRanges, i.e., as a table, not a list. We could introduce the
notion of “wound” ranges, where unwind()
would flatten a
GRangesList to a GRanges, and wind()
would wind a GRanges into a
GRangesList, e.g.
wind(exons, tx_id)
For read alignments, the data currently arrive as a GAlignments
object. The user should never see this. But it is not clear how to
represent the alignment structure as a GRanges. The user might
expect one row per alignment (with a range spanning it), and there
could be a split_alignments()
that breaks up the ranges by the
cigar N codes and groups by the original alignments automatically:
bam %>% split_alignments()
And maybe a split_alignments_by_deletions()
to exclude the D
regions, but that is uncommon.
We can construct ranges from their basic components (start, end,
width). We could guess the variable names if a tabular object is
passed to the constructor, just how GRanges()
works now:
df %>% Ranges
Maybe extra parameters could specify atypical mappings, like:
df %>% Ranges(start=st, width=wdth)
This matches up well with existing tabular constructors like
data.frame()
and data_frame()
.
IRanges has already explored this space and defined its own “algebra”, but we could revisit it. bedtools is also inspiring.
We need to make adjustments to the start, end and width. Extending
dplyr, this would just be mutate()
, e.g.:
mutate(x, start=start+1L)
And while that could be supported, we typically want to make relative adjustments.
One complication is that the start, end and width are mutually
dependent. Changing the start, as above, needs to change either
the width or the end to preserve integrity. We assume that the
user thinks of the range as a pair of endpoints (this is natural
in genomics anyway), so setting the start will modify the width
but preserve the end. The opposite, preserving the start but not
the other endpoint, is IRanges::shift()
. Maybe we should be more
explicit about whether the width or the other endpoint is changed
by side effect.
Here are some possible verbs for changing the width:
stretch_start/end
- Extend the range by ‘x’ by pulling the
start or the end (and leaving the other one anchored),
squeeze_start/end
- Opposite of stretch (could just use
negative value for stretch, but being explicit seems preferable in many cases),
Not sure whether squeeze
is actually useful, as we typically
want to stretch a range to consider context, not squeeze it.
Or maybe it would be more obvious if we supported the notion of anchoring, similar to grouping. This would require decorating the object. For example,
stretch(anchor_end(x), y)
This pushes information from the syntax to the object state. While it simplifies the grammar, object behavior is more complex and implicit. But if the code is part of the same chain, it seems intuitive.
There would be three ways to anchor:
anchor_start()
anchor_end()
anchor_center()
Anchoring the midpoint would enable symmetric zooming.
This is basically a deconstruction of IRanges::resize()
, except
resize()
is absolute. Perhaps set_width()
could play
the role of resize()
. It is not clear whether we need
set_start()
and set_end()
.
When strand is involved, we will need something like:
-
anchor_3p
(start for negative strand) -
anchor_5p
(end for negative strand)
Anchoring the width would not make sense, because (1) it is not a point that can be intuitively anchored and (2) stretching and squeezing obviously modify the width, so it would not apply to those.
For width-invariant changes, there is shift()
from the IRanges
API. That is a symmetric operation (positive for right, negative
for left). We could introduce:
shift_left()
shift_right()
If we wanted the direction to be explicit.
Note that with GRanges, strand can modulate direction, and we need to distinguish left/right (chromosomal coordinates) from the direction of transcription:
shift_upstream()
-
shift_downstream()
Fluent IRanges bedtools stretch(anchor_start()) - slop -r stretch(anchor_end()) - slop -l stretch(anchor_center()) +() slop -b set_width(anchor_*) resize(fix=*) - shift_right() shift() shift -s shift_left() - - shift_up/downstream() - -
Note that resize,GRanges()
and slop -s
use the strand for
direction, i.e, anchor_3p()
and anchor_5p()
. Bedtools supports
width-relative adjustment with slop -pct
; we will let the user
be explicit, e.g.: stretch(x, 2*width(x))
.
One problem is that can be stretched or shifted out of bounds, or squeezed out of existence.
Currently, GenomicRanges will warn the user when a range is pushed
out of bounds and suggest calling trim()
. We should probably
throw an error in this case, because it is never a good idea for
an object to be in a (semi)-invalid state. The resolution needs
to be explicit, as part of the function call. By default,
we throw an error. There could be variants of all of these
operations with the post fix _up_to
that imply the trim,
e.g., stretch_up_to()
or set_width_up_to()
. Another
resolution would be to drop the ranges that would become
out of bounds. Those could use the _or_drop()
postfix.
We could also use this anchoring notion to define a new verb
flip()
that generates flanking regions, but that will cause
semantic issues with strandedness. Instead, we probably need
flank_left()
and flank_right()
. With strand, there would be
flank_before()
and flank_after()
.
Binary operations between two ranges mostly align with set operations that treat ranges as sets of integers. These are parallel/vectorized operations between two vectors/tables of ranges, so are not to be confused with the aggregating set operations described later.
Currently, these set operations have the p
prefix, e.g.,
punion()
. We could make the binary nature more obvious by
aliasing them as operators:
%union%
%intersect%
%setdiff%
There are also operations that relate to the relative positions of two ranges:
-
between()
: the range in the gap between the two ranges- To get the separation, just do
width(between(x, y))
- To get the separation, just do
-
span()
: a range that spans both ranges (and the gap)
As a note, to express a binary operation involving the subject of a
mutate()
, summarize()
or whatever call, we could use the pronoun:
mutate(x, foo = width(. %intersect% other_ranges))
The Allen’s Interval Algebra defines different relationships between
ranges. We could map those to special %x%
operators (the formatting of this
table is messed up in the GitHub display, so click Edit to the org-mode source):
Example | Operator | Inverse |
---|---|---|
“xxxx ” |
%upstream_of% , %left_of%
|
|
” xxxx” |
%downstream_of% , %right_of%
|
|
“xxxx ” |
%flanks_upstream% , %flanks_left%
|
|
” xxxxx” |
%flanks_downstream% , %flanks_right%
|
|
“xxxx ” | %over% |
%outside% |
” xxxxxx” | ||
“xxxx ” |
%begins% , %starts%
|
|
“xxxxxx ” |
%begun_by% , %started_by%
|
|
” xxx ” | %within% |
|
” xxxxxx ” | %includes% |
|
” xxxxx” |
%finishes% , %ends%
|
|
” xxxxxxx” |
%finished_by% , %ended_by%
|
|
“xxxxxxxxx” | == |
%!=% |
“xxxxxxxxx” |
Of those, only %over%
and ==
(and their inverses) are currently
supported. Not clear how many of these are actually useful in
genomics, but these are easy to implement and describe.
For some relationships, there are multiple operators, which behave differently with respect to strand/direction.
The “before” and “after” relationships have only proven useful for
some k, usually k=1. Those correspond to precede()
and
follow()
in the current framework.
We can keep or discard ranges based on range-specific comparisons. At the low-level, one could filter as a table. For example, to filter to the ranges that overlap some other ranges:
filter(x, ranges %over% other_ranges)
But we typically want to abstract references to ranges
. This means
high-level functions, including:
-
filter_by_overlaps()
as an analog ofsubsetByOverlaps()
- How to invert this (equivalent of
%outside%
)?-
filter_by_non_overlaps()
?
-
- How to invert this (equivalent of
General aggregation will operate on a preceding group_by()
, so
objects will need to keep track of their grouping.
We often want to aggregate based on a matching/overlap operation,
e.g., summarizing ranges that fall into some windows, or breaking
ties in nearest hits. Ideally, some sort of grouping could be
carried over from a previous overlap join, i.e., there is some
variable that groups the joined ranges by the original query (or
subject) range. This corresponds to the bedtools map
tool. See the
Merging section for more details.
Other range aggregations include:
range
- min start and max end
- a little confusing due to multiple use of the word “range”,
maybe
compute_extents
?
- a little confusing due to multiple use of the word “range”,
maybe
coverage
- counting overlapping ranges per position
- the range analog of
table()
- needs to be an endomorphism, not generate an RleList
- the output should be a GPos by default, as coverage is typically
seen as a per-position summary
-
compute_runs
could convert to an run-length encoded GRanges
-
-
covered_by_any()
andcovered_by_all()
may be useful conveniences based on the coverage and the equivalence of ranges and logical vectors (TRUE
inside,FALSE
outside), seecoerce,Rle,Ranges()
.
There is also a group of set operations view the range as a set of integers, counting from the start through the end. IRanges currently implements
intersect()
,union()
, andsetdiff()
. There is some confusion though, because%in%()
andmatch()
operate at the range level and thus do not follow the compatible semantics. The ranges equivalent ofunique()
isreduce()
, anddisjoin()
is in the same family.There are also
p
variants of these functions that operate in parallel down two objects. While these do aggregate ranges (if we think of them as sets of integers), they operate within individual rows, so they are not aggregations in the present sense.Some more notes on the special operations:
reduce
- combine overlapping and adjacent ranges
- in a way this is the range analog of
unique()
- bedtools calls it
merge
- in a way this is the range analog of
disjoin
- at the set level equivalent to
reduce
, but really a
- the range analog of
union of the end points
The range aggregations themselves imply a grouping, which we want to use for aggregating other variables in the dataset. Some of the operations currently return a “revmap” list to do this.
We could express the grouping verbs more explicitly, e.g.:
-
reduce()
maps togroup_by_any_overlap()
-
disjoin()
maps togroup_by_all_overlap()
(many-to-many)
So one could imagine an API where a reduce()
is specified by:
r %>% group_by_any_overlap() %>% summarize(foo.mean=mean(foo))
where the grouping result is a deferred reduce()
, finally executed
by the summarize()
call. The problem is that one could interpret
the semantic of summarize()
to be “apply these operations to each
group, generating a list of tables, and stack them”, i.e., there is
an implicit desire to perform a reduce()
that should be made more
explicit. This argues for coalescing the operations into:
r %>% reduce(foo.mean=mean(foo))
In the context of general tabular manipulations, it might make sense to explicitly qualify these summaries as range operations. This must be done consistently. I.e.,
intersect_ranges()
union_ranges()
setdiff_ranges()
reduce_ranges()
disjoin_ranges()
It would be nice to have a complement_ranges()
that is like
gaps()
but does not treat strand as a “space”, which has the
frustrating effect of including the entire sequence in gaps on the
other strands. The name gaps()
is also confusing, since it
implies the result will only contain ranges between a pair of input
ranges (the flanking regions are also included).
There are several common types of joins:
- Mutate (combination) join (
merge()
) - Filter join (filter with
%in%
) - Set operations (
intersect()
, etc)
A join consists of two steps:
- Matching to generate a bipartite graph, and a corresponding table where each row represents an edge (or, optionally, node for singletons) in that graph,
- Optionally apply some aggregation on the graph/table, like
- Taking the left nodes with at least one edge for filtering,
- Taking all unique left nodes and singleton right nodes for union.
The types of matching correspond to the comparison operations, described above. Conventional joins tend to match on identity, so the key columns are folded together. Would we carry over the other ranges, or drop them like an identity join? Probably carry them over.
Matching operations yield, internally, a Hits object, which we
have recently started modeling as a graph. The canonical coercion to
a Grouping is by the query (left) node, which is useful, because we
are typically operating on the ranges of interest and want to
summarize another set with respect to them. We can describe this as
a new type of join: a grouping join. It defers the aggregation and
just groups by the graph in some way, typically by left node. That
operation would be group_by_left_join()
. This is not perfect
though, because the function does a lot more than group the rows.
Sometimes we want to group by the connected components of the graph, so there should be a convenient way to do that.
The most obvious and useful matching is the overlap, and
findOverlaps()
has supported that for a long time, along with
restrictions for “start”, “end”, “within” and “equal”. We want to
recast findOverlap()
and similar matching functions to join
operations, maybe with the term ojoin
instead of join
. Would we
want join*
variants for all operations, like
flanking_upstream_join()
and finishing_join()
? Really only “any”
overlap and “within” overlap, along with nearest neighbords, has
proven useful in the current framework.
A very common aggregation on overlaps is counting how many subject
ranges overlap each query, i.e., countOverlaps()
. This could be
done using dplyr:
left_join(x, group_by_left_ojoin(., y) %>%
summarize(key = unique(key), count = N()),
"key")
Reading that,
mutate(x, count = count_overlaps(., y))
seems a lot easier to understand. And it also suggests that we do
not need to use join terminology at all, even though the operations
could be classified as joins. In other words, we could keep
find_overlaps()
as a left-ojoin, and group_by_overlaps()
as a
join that groups the matches by query. It does not solve the
problem, however, of the complexity of implementing something like
count_overlaps()
. We could use a hybrid of dplyr and base R:
mutate(x, count = lengths(group_by_overlaps(., y)))
Or maybe a bit more explicit,
mutate(x, count = group_nrows(group_by_overlaps(., y)))
In the case of GRanges, findOverlaps()
has always required
identical strand. This often surprises users, including
experts. Instead, the require of identical strand should be
expressed as a filter()
on the find_overlaps()
result.
When the input to find_overlaps()
is already grouped, overlap is
at the group level, i.e., if any two ranges in different groups
overlap, those groups are considered overlapping. This supports
GRangesList-style overlap operations.
For nearest neighbors, there is nearest()
, precede()
, and
follow()
. We should give these more consistent names, like:
join_nearest()
- nearest on either side
join_nearest_left/right()
- nearest on the left/right (strand ignored)
join_nearest_up/downstream()
- nearest on one side, considering strand
Typical post-filters will include exclusion by distance, exclusion of overlapping ranges, etc. Tie breaking is an example of a post-aggregation.
Things to order by (in some order of precedence):
- Start, end, width, midpoint?
While sort()
might follow IRanges (start, end), arrange()
or
sort(by =)
would make this explicit.
For GRanges, seqname and strand are also useful.
This is a place where generic tables and annotated range tables diverge. Each row of our table has a primary value, the range, on which we can sort. Do we force the user to spell out the range-sorting, e.g:
r %>% arrange(start, end, score)
Or do we provide a convenience:
r %>% arrange_ranges(score)
For some reason arrange_ranges()
does not seem obvious
enough. Probably should keep it explicit.
The canonical way to import annotated genomic ranges is
rtracklayer::import()
. While import()
is a useful abstraction,
it may be too abstract. There is benefit to enabling the user to
express the expected return value and/or the format of the file
being parsed. Since we are more or less stipulating that every
object is a GRanges, there is no need to be explicit about the type
of the return value. Instead, we could follow the pattern of
read.csv()
and read_csv()
and have functions named according to
read_[format]()
, e.g., read_bed()
, read_bam()
, etc.
For BAM files specifically, the user typically wants the alignments,
but sometimes the read sequences are also useful. Should that be a separate
function (read_bam_with_sequences()
), or a parameter?