-
Notifications
You must be signed in to change notification settings - Fork 193
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
make counting more robust to input datatype #722
Changes from 43 commits
3905f6e
8eb5855
84ef887
d82295d
a4505b2
c334d4d
94b3e6c
87074b2
aa21bc9
309586f
3b5e01d
3384f45
6b81b51
18b70b5
7aa2d79
93caf72
62c5967
1238a2d
c1c09d1
bc20432
cbfd92e
60c0aad
c72b1aa
e4ca264
00fe408
a470531
aca38a3
168b3ab
204f89a
3f3e657
657c1ec
2ebf1f3
f469f0c
6cf64d9
dc1f7ae
4fd45d9
d31dcce
0fd1c27
fbdf564
e1fab99
717c795
ba59cc3
2dfba9f
8f446ee
e20e28c
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -16,24 +16,24 @@ end | |
|
||
#### functions for counting a single list of integers (1D) | ||
""" | ||
addcounts!(r, x, levels::UnitRange{<:Int}, [wv::AbstractWeights]) | ||
addcounts!(r, x, levels::UnitRange{<:Integer}, [wv::AbstractWeights]) | ||
|
||
Add the number of occurrences in `x` of each value in `levels` to an existing | ||
array `r`. If a weighting vector `wv` is specified, the sum of weights is used | ||
rather than the raw counts. | ||
array `r`. For each `xi ∈ x`, if `xi == levels[j]`, then we increment `r[j]`. | ||
|
||
If a weighting vector `wv` is specified, the sum of weights is used rather than the | ||
raw counts. | ||
""" | ||
function addcounts!(r::AbstractArray, x::IntegerArray, levels::IntUnitRange) | ||
# add counts of integers from x to r | ||
# add counts of integers from x that fall within levels to r | ||
|
||
k = length(levels) | ||
length(r) == k || throw(DimensionMismatch()) | ||
checkbounds(r, axes(levels)...) | ||
|
||
m0 = levels[1] | ||
m1 = levels[end] | ||
b = m0 - 1 | ||
m0 = first(levels) | ||
m1 = last(levels) | ||
b = m0 - firstindex(levels) # firstindex(levels) == 1 because levels::IntUnitRange | ||
|
||
@inbounds for i in 1 : length(x) | ||
xi = x[i] | ||
@inbounds for xi in x | ||
if m0 <= xi <= m1 | ||
r[xi - b] += 1 | ||
end | ||
|
@@ -42,15 +42,21 @@ function addcounts!(r::AbstractArray, x::IntegerArray, levels::IntUnitRange) | |
end | ||
|
||
function addcounts!(r::AbstractArray, x::IntegerArray, levels::IntUnitRange, wv::AbstractWeights) | ||
k = length(levels) | ||
length(r) == k || throw(DimensionMismatch()) | ||
# add wv weighted counts of integers from x that fall within levels to r | ||
|
||
length(x) == length(wv) || | ||
throw(DimensionMismatch("x and wv must have the same length, got $(length(x)) and $(length(wv))")) | ||
|
||
m0 = levels[1] | ||
m1 = levels[end] | ||
xv = vec(x) # discard shape because weights() discards shape | ||
|
||
checkbounds(r, axes(levels)...) | ||
LilithHafner marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
m0 = first(levels) | ||
m1 = last(levels) | ||
b = m0 - 1 | ||
|
||
@inbounds for i in 1 : length(x) | ||
xi = x[i] | ||
@inbounds for i in eachindex(xv, wv) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this will fail if
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Previously, There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. OK - I thought you wanted to fix OffsetArrays.jl issues in this PR, but agreed it can go to a separate PR since what you propose now is safe. However note that in:
you already decide on the interface you mention - choosing to support "position in sequence" interpretation (if There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That choice was forced to avoid a regression on It is also an implicit conversion to linear indexing, which is somewhat different than an implicit alignment of mismatched indices.
I don't see this as an issue: julia> w = Weights([4,2,6]);
julia> ow = Weights(OffsetVector([4,2,6], 2));
julia> x = [3,1,2];
julia> ox = OffsetVector([3,1,2], 2);
julia> addcounts!(zeros(Int, 3), x, 1:3, w)
3-element Vector{Int64}:
2
6
4
julia> addcounts!(zeros(Int, 3), ox, 1:3, ow)
3-element Vector{Int64}:
2
6
4
julia> addcounts!(zeros(Int, 3), ox, 1:3, w)
ERROR: DimensionMismatch: all inputs to eachindex must have the same axes, got (OffsetArrays.IdOffsetRange(values=3:5, indices=3:5),) and (Base.OneTo(3),)
Stacktrace:
[1] throw_eachindex_mismatch_indices(::IndexCartesian, ::Tuple{OffsetArrays.IdOffsetRange{Int64, Base.OneTo{Int64}}}, ::Vararg{Any})
@ Base ./abstractarray.jl:292
[2] eachindex
@ ./multidimensional.jl:388 [inlined]
[3] eachindex
@ ./abstractarray.jl:334 [inlined]
[4] addcounts!(r::Vector{Int64}, x::OffsetVector{Int64, Vector{Int64}}, levels::UnitRange{Int64}, wv::Weights{Int64, Int64, Vector{Int64}})
@ StatsBase ~/.julia/dev/StatsBase/src/counts.jl:58
[5] top-level scope
@ REPL[24]:1
julia> addcounts!(zeros(Int, 3), x, 1:3, ow)
ERROR: DimensionMismatch: all inputs to eachindex must have the same axes, got (Base.OneTo(3),) and (OffsetArrays.IdOffsetRange(values=3:5, indices=3:5),)
Stacktrace:
[1] throw_eachindex_mismatch_indices(::IndexCartesian, ::Tuple{Base.OneTo{Int64}}, ::Vararg{Any})
@ Base ./abstractarray.jl:292
[2] eachindex
@ ./multidimensional.jl:388 [inlined]
[3] eachindex
@ ./abstractarray.jl:334 [inlined]
[4] addcounts!(r::Vector{Int64}, x::Vector{Int64}, levels::UnitRange{Int64}, wv::Weights{Int64, Int64, OffsetVector{Int64, Vector{Int64}}})
@ StatsBase ~/.julia/dev/StatsBase/src/counts.jl:58
[5] top-level scope
@ REPL[25]:1
julia>
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. To allow different index styles (but same length) we could do:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||
xi = xv[i] | ||
if m0 <= xi <= m1 | ||
r[xi - b] += wv[i] | ||
end | ||
|
@@ -69,8 +75,8 @@ falling in that range will be considered (the others will be ignored without | |
raising an error or a warning). If an integer `k` is provided, only values in the | ||
range `1:k` will be considered. | ||
|
||
If a weighting vector `wv` is specified, the sum of the weights is used rather than the | ||
raw counts. | ||
If a vector of weights `wv` is provided, the proportion of weights is computed rather | ||
than the proportion of raw counts. | ||
|
||
The output is a vector of length `length(levels)`. | ||
""" | ||
|
@@ -90,8 +96,10 @@ counts(x::IntegerArray, wv::AbstractWeights) = counts(x, span(x), wv) | |
proportions(x, levels=span(x), [wv::AbstractWeights]) | ||
|
||
Return the proportion of values in the range `levels` that occur in `x`. | ||
Equivalent to `counts(x, levels) / length(x)`. If a weighting vector `wv` | ||
is specified, the sum of the weights is used rather than the raw counts. | ||
Equivalent to `counts(x, levels) / length(x)`. | ||
|
||
If a vector of weights `wv` is provided, the proportion of weights is computed rather | ||
than the proportion of raw counts. | ||
""" | ||
proportions(x::IntegerArray, levels::IntUnitRange) = counts(x, levels) .* inv(length(x)) | ||
proportions(x::IntegerArray, levels::IntUnitRange, wv::AbstractWeights) = | ||
|
@@ -101,6 +109,9 @@ proportions(x::IntegerArray, levels::IntUnitRange, wv::AbstractWeights) = | |
proportions(x, k::Integer, [wv::AbstractWeights]) | ||
|
||
Return the proportion of integers in 1 to `k` that occur in `x`. | ||
|
||
If a vector of weights `wv` is provided, the proportion of weights is computed rather | ||
than the proportion of raw counts. | ||
""" | ||
proportions(x::IntegerArray, k::Integer) = proportions(x, 1:k) | ||
proportions(x::IntegerArray, k::Integer, wv::AbstractWeights) = proportions(x, 1:k, wv) | ||
|
@@ -110,26 +121,22 @@ proportions(x::IntegerArray, wv::AbstractWeights) = proportions(x, span(x), wv) | |
#### functions for counting a single list of integers (2D) | ||
|
||
function addcounts!(r::AbstractArray, x::IntegerArray, y::IntegerArray, levels::NTuple{2,IntUnitRange}) | ||
# add counts of integers from x to r | ||
|
||
n = length(x) | ||
length(y) == n || throw(DimensionMismatch()) | ||
# add counts of pairs from zip(x,y) to r | ||
|
||
xlevels, ylevels = levels | ||
|
||
kx = length(xlevels) | ||
ky = length(ylevels) | ||
size(r) == (kx, ky) || throw(DimensionMismatch()) | ||
|
||
mx0 = xlevels[1] | ||
mx1 = xlevels[end] | ||
my0 = ylevels[1] | ||
my1 = ylevels[end] | ||
checkbounds(r, axes(xlevels, 1), axes(ylevels, 1)) | ||
|
||
mx0 = first(xlevels) | ||
mx1 = last(xlevels) | ||
my0 = first(ylevels) | ||
my1 = last(ylevels) | ||
|
||
bx = mx0 - 1 | ||
by = my0 - 1 | ||
|
||
for i = 1:n | ||
for i in eachindex(vec(x), vec(y)) | ||
xi = x[i] | ||
yi = y[i] | ||
if (mx0 <= xi <= mx1) && (my0 <= yi <= my1) | ||
|
@@ -141,28 +148,31 @@ end | |
|
||
function addcounts!(r::AbstractArray, x::IntegerArray, y::IntegerArray, | ||
levels::NTuple{2,IntUnitRange}, wv::AbstractWeights) | ||
# add counts of integers from x to r | ||
# add counts of pairs from zip(x,y) to r | ||
|
||
length(x) == length(y) == length(wv) || | ||
throw(DimensionMismatch("x, y, and wv must have the same length, but got $(length(x)), $(length(y)), and $(length(wv))")) | ||
|
||
axes(x) == axes(y) || | ||
throw(DimensionMismatch("x and y must have the same axes, but got $(axes(x)) and $(axes(y))")) | ||
|
||
n = length(x) | ||
length(y) == length(wv) == n || throw(DimensionMismatch()) | ||
xv, yv = vec(x), vec(y) # discard shape because weights() discards shape | ||
|
||
xlevels, ylevels = levels | ||
|
||
kx = length(xlevels) | ||
ky = length(ylevels) | ||
size(r) == (kx, ky) || throw(DimensionMismatch()) | ||
checkbounds(r, axes(xlevels, 1), axes(ylevels, 1)) | ||
|
||
mx0 = xlevels[1] | ||
mx1 = xlevels[end] | ||
my0 = ylevels[1] | ||
my1 = ylevels[end] | ||
mx0 = first(xlevels) | ||
mx1 = last(xlevels) | ||
my0 = first(ylevels) | ||
my1 = last(ylevels) | ||
|
||
bx = mx0 - 1 | ||
by = my0 - 1 | ||
|
||
for i = 1:n | ||
xi = x[i] | ||
yi = y[i] | ||
for i in eachindex(xv, yv, wv) | ||
xi = xv[i] | ||
yi = yv[i] | ||
if (mx0 <= xi <= mx1) && (my0 <= yi <= my1) | ||
r[xi - bx, yi - by] += wv[i] | ||
end | ||
|
@@ -235,13 +245,15 @@ end | |
|
||
|
||
""" | ||
addcounts!(dict, x[, wv]; alg = :auto) | ||
addcounts!(dict, x; alg = :auto) | ||
addcounts!(dict, x, wv) | ||
|
||
Add counts based on `x` to a count map. New entries will be added if new values come up. | ||
|
||
If a weighting vector `wv` is specified, the sum of the weights is used rather than the | ||
raw counts. | ||
|
||
`alg` can be one of: | ||
`alg` is only allowed for unweighted counting and can be one of: | ||
- `:auto` (default): if `StatsBase.radixsort_safe(eltype(x)) == true` then use | ||
`:radixsort`, otherwise use `:dict`. | ||
|
||
|
@@ -284,9 +296,9 @@ function addcounts_dict!(cm::Dict{T}, x) where T | |
end | ||
|
||
# If the bits type is of small size i.e. it can have up to 65536 distinct values | ||
# then it is always better to apply a counting-sort like reduce algorithm for | ||
# then it is always better to apply a counting-sort like reduce algorithm for | ||
# faster results and less memory usage. However we still wish to enable others | ||
# to write generic algorithms, therefore the methods below still accept the | ||
# to write generic algorithms, therefore the methods below still accept the | ||
# `alg` argument but it is ignored. | ||
function _addcounts!(::Type{Bool}, cm::Dict{Bool}, x::AbstractArray{Bool}; alg = :ignored) | ||
sumx = sum(x) | ||
|
@@ -335,32 +347,42 @@ const BaseRadixSortSafeTypes = Union{Int8, Int16, Int32, Int64, Int128, | |
"Can the type be safely sorted by radixsort" | ||
radixsort_safe(::Type{T}) where T = T<:BaseRadixSortSafeTypes | ||
|
||
function _addcounts_radix_sort_loop!(cm::Dict{T}, sx::AbstractArray{T}) where T | ||
function _addcounts_radix_sort_loop!(cm::Dict{T}, sx::AbstractVector{T}) where T | ||
LilithHafner marked this conversation as resolved.
Show resolved
Hide resolved
|
||
isempty(sx) && return cm | ||
last_sx = sx[1] | ||
tmpcount = get(cm, last_sx, 0) + 1 | ||
last_sx = first(sx) | ||
start_i = firstindex(sx) | ||
|
||
# now the data is sorted: can just run through and accumulate values before | ||
# adding into the Dict | ||
@inbounds for i in 2:length(sx) | ||
@inbounds for i in start_i+1:lastindex(sx) | ||
LilithHafner marked this conversation as resolved.
Show resolved
Hide resolved
|
||
sxi = sx[i] | ||
if last_sx == sxi | ||
tmpcount += 1 | ||
else | ||
cm[last_sx] = tmpcount | ||
if last_sx != sxi | ||
cm[last_sx] = get(cm, last_sx, 0) + i - start_i | ||
last_sx = sxi | ||
tmpcount = get(cm, last_sx, 0) + 1 | ||
start_i = i | ||
end | ||
end | ||
|
||
cm[sx[end]] = tmpcount | ||
last_sx = last(sx) | ||
cm[last_sx] = get(cm, last_sx, 0) + lastindex(sx) + 1 - start_i | ||
|
||
return cm | ||
end | ||
|
||
function _alg(x::AbstractArray) | ||
@static if VERSION >= v"1.9.0-DEV" | ||
Base.DEFAULT_UNSTABLE | ||
else | ||
firstindex(x) == 1 || | ||
throw(ArgumentError("alg=:radixsort requires either one based indexing or Julia >= 1.9. " | ||
LilithHafner marked this conversation as resolved.
Show resolved
Hide resolved
|
||
* "Use `alg = :dict` as an alternative.")) | ||
SortingAlgorithms.RadixSort | ||
end | ||
end | ||
|
||
function addcounts_radixsort!(cm::Dict{T}, x::AbstractArray{T}) where T | ||
# sort the x using radixsort | ||
sx = sort(x, alg = RadixSort) | ||
sx = sort(vec(x), alg=_alg(x)) | ||
|
||
# Delegate the loop to a separate function since sort might not | ||
# be inferred in Julia 0.6 after SortingAlgorithms is loaded. | ||
|
@@ -369,18 +391,24 @@ function addcounts_radixsort!(cm::Dict{T}, x::AbstractArray{T}) where T | |
end | ||
|
||
# fall-back for `x` an iterator | ||
function addcounts_radixsort!(cm::Dict{T}, x) where T | ||
sx = sort!(collect(x), alg = RadixSort) | ||
function addcounts_radixsort!(cm::Dict{T}, x) where T | ||
cx = vec(collect(x)) | ||
sx = sort!(cx, alg = _alg(cx)) | ||
return _addcounts_radix_sort_loop!(cm, sx) | ||
end | ||
|
||
function addcounts!(cm::Dict{T}, x::AbstractArray{T}, wv::AbstractVector{W}) where {T,W<:Real} | ||
n = length(x) | ||
length(wv) == n || throw(DimensionMismatch()) | ||
# add wv weighted counts of integers from x to cm | ||
|
||
LilithHafner marked this conversation as resolved.
Show resolved
Hide resolved
|
||
length(x) == length(wv) || | ||
throw(DimensionMismatch("x and wv must have the same length, got $(length(x)) and $(length(wv))")) | ||
|
||
xv = vec(x) # discard shape because weights() discards shape | ||
nalimilan marked this conversation as resolved.
Show resolved
Hide resolved
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thinking about this again, maybe we should deprecate this behavior by adding a separate method like this: @deprecate addcounts!(cm::Dict{T}, x::AbstractArray{T}, wv::AbstractVector{W}) where {T,W<:Real} = addcounts!(cm, vec(x), wv) And restrict the signature of this method to |
||
|
||
z = zero(W) | ||
|
||
for i = 1 : n | ||
@inbounds xi = x[i] | ||
for i in eachindex(xv, wv) | ||
@inbounds xi = xv[i] | ||
@inbounds wi = wv[i] | ||
cm[xi] = get(cm, xi, z) + wi | ||
end | ||
|
@@ -390,11 +418,14 @@ end | |
|
||
""" | ||
countmap(x; alg = :auto) | ||
countmap(x::AbstractVector, w::AbstractVector{<:Real}; alg = :auto) | ||
countmap(x::AbstractVector, wv::AbstractVector{<:Real}) | ||
|
||
Return a dictionary mapping each unique value in `x` to its number | ||
of occurrences. A vector of weights `w` can be provided when `x` is a vector. | ||
Return a dictionary mapping each unique value in `x` to its number of occurrences. | ||
|
||
If a weighting vector `wv` is specified, the sum of weights is used rather than the | ||
raw counts. | ||
|
||
`alg` is only allowed for unweighted counting and can be one of: | ||
- `:auto` (default): if `StatsBase.radixsort_safe(eltype(x)) == true` then use | ||
`:radixsort`, otherwise use `:dict`. | ||
|
||
|
@@ -413,10 +444,27 @@ countmap(x::AbstractArray{T}, wv::AbstractVector{W}) where {T,W<:Real} = addcoun | |
|
||
|
||
""" | ||
proportionmap(x) | ||
proportionmap(x; alg = :auto) | ||
proportionmap(x::AbstractVector, w::AbstractVector{<:Real}) | ||
|
||
Return a dictionary mapping each unique value in `x` to its proportion in `x`. | ||
|
||
If a vector of weights `wv` is provided, the proportion of weights is computed rather | ||
than the proportion of raw counts. | ||
|
||
`alg` is only allowed for unweighted counting and can be one of: | ||
- `:auto` (default): if `StatsBase.radixsort_safe(eltype(x)) == true` then use | ||
`:radixsort`, otherwise use `:dict`. | ||
|
||
Return a dictionary mapping each unique value in `x` to its | ||
proportion in `x`. | ||
- `:radixsort`: if `radixsort_safe(eltype(x)) == true` then use the | ||
[radix sort](https://en.wikipedia.org/wiki/Radix_sort) | ||
algorithm to sort the input vector which will generally lead to | ||
shorter running time. However the radix sort algorithm creates a | ||
copy of the input vector and hence uses more RAM. Choose `:dict` | ||
if the amount of available RAM is a limitation. | ||
|
||
- `:dict`: use `Dict`-based method which is generally slower but uses less | ||
RAM and is safe for any data type. | ||
""" | ||
proportionmap(x::AbstractArray) = _normalize_countmap(countmap(x), length(x)) | ||
proportionmap(x::AbstractArray, wv::AbstractWeights) = _normalize_countmap(countmap(x, wv), sum(wv)) |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -21,6 +21,7 @@ length(wv::AbstractWeights) = length(wv.values) | |
sum(wv::AbstractWeights) = wv.sum | ||
isempty(wv::AbstractWeights) = isempty(wv.values) | ||
size(wv::AbstractWeights) = size(wv.values) | ||
Base.axes(wv::AbstractWeights) = Base.axes(wv.values) | ||
LilithHafner marked this conversation as resolved.
Show resolved
Hide resolved
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is this definition correct? We only support There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The As for For example using StatsBase
w = Weights([4,2,6])
w[0x2] = 1
display(w)
#=
3-element Weights{Int64, Int64, Vector{Int64}}:
4
1
6
=# The previous implicit definition ( using StatsBase, OffsetArrays
w = Weights(OffsetVector([4,2,6], 2))
display(w)
#=
3-element Weights{Int64, Int64, OffsetVector{Int64, Vector{Int64}}}:
3 # unsafe read
4972085568 # unsafe read
4
=# On this PR using StatsBase, OffsetArrays
w = Weights(OffsetVector([4,2,6], 2))
display(w)
#=
3-element Weights{Int64, Int64, OffsetVector{Int64, Vector{Int64}}} with indices 3:5:
4
2
6
=# There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Given the discussion in #791 I would propose to add tests of the consequences of this change (e.g. making sure that There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This change does not add any new functionality, it simply removes segfaults from existing functionality. @bkamins, would you be willing to create a PR adding those tests? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. if StatsBase.jl maintainers are OK to merge this PR without these tests then I can add them later. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I guess it's better to avoid known segfaults (or worse, incorrect behavior) even if not tested completely, so in the interest of avoiding this PR from being forgotten again we could merge without additional tests. |
||
|
||
Base.convert(::Type{Vector}, wv::AbstractWeights) = convert(Vector, wv.values) | ||
|
||
|
@@ -299,6 +300,7 @@ sum(wv::UnitWeights{T}) where T = convert(T, length(wv)) | |
isempty(wv::UnitWeights) = iszero(wv.len) | ||
length(wv::UnitWeights) = wv.len | ||
size(wv::UnitWeights) = tuple(length(wv)) | ||
Base.axes(wv::UnitWeights) = tuple(Base.OneTo(length(wv))) | ||
|
||
Base.convert(::Type{Vector}, wv::UnitWeights{T}) where {T} = ones(T, length(wv)) | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think
better expresses the intention in the original code. Also maybe this condition should be noted in the documentation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the previous version errored when
r
was larger thanlevels
and if we keep that restriction, it should be noted in the documentation.However, it feels more natural to me not to keep that restriction, especially since it is not documented. So long as what we are adding to
r
fits intor
, this method will work and other entries ofr
will not be affected. This looser restriction and guarantee not to mess with other entries ofr
is implicit and need not be mentioned in the docstring.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue is broader. With your code if
r
isOffsetArray
then indices from1
tolength(levels)
must be valid in it. But then you are adding values tor
in indices starting from1
, which in general does not have to be first element ofr
. The docstring does not specify where the elements are put intor
. There can be two options:1
orr
That is why I propose to add
firstindex(r)==1
condition which resolves this issue as then both options are the same.I agree that we could allow
r
to be longer thanlength(levels)
(this is a change, but I think it is acceptable and makes sense).There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good observation. There could be an ambiguity here. I see two solutions:
Base.require_one_based_indexing(r)
)I prefer option 1 because I think the choice made here is distinctly more natural than the choice to start at
firstindex(r)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Option 1 sounds fine, the way
r
is used should have been explained better anyway in the current version (one could have imagined that the value of each level was used as to index intor
rather than starting at 1). Allowingr
to be longer thanlength(levels)
sounds OK too as long as it's documented.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes - it is chcecked. My point is that it should be documented.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kalmarek
Could you link that issue? I'm not seeing it and want to make sure the documentation I added addresses it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it wasn't a gh issue, but in my comments about
addcounts!
I missed the fact thatcheckbounds
forcesr
to accept indices from1
tolenght(levels)
, sincelevels::IntUnitRange
. This is something I'd call "too clever to maintain" in the long run, unless explicitly documented.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For completeness, here is the link: #722 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This thread is about ambiguity in behavior when passed an
r
that uses offset indexing. That ambiguity is resolved in the current documentation: "For eachxi ∈ x
, ifxi == levels[j]
, then we incrementr[j]
". This is a different issue than the one @kalmarek raised in the linked comment which is about whether this method could segfault on certain nonstandard indices.I pushed another commit that should address @kalmarek's latest critique. Even though we know
firstindex(levels) == 1
, we can still writefirstindex(levels)
instead of1
. (This should future proof this method in the event that Julia switches to 2 based indexing at any point in the future :), or, much more plausibly, ifIntUnitRange
ever allows offset indices.)