Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

make counting more robust to input datatype #722

Merged
merged 45 commits into from
Jun 8, 2022
Merged
Show file tree
Hide file tree
Changes from 43 commits
Commits
Show all changes
45 commits
Select commit Hold shift + click to select a range
3905f6e
support offset arrays and simplify _addcounts_radix_sort_loop
LilithHafner Oct 2, 2021
8eb5855
add tests and fix more occurances of unsupported offset arrays [ref](…
LilithHafner Oct 7, 2021
84ef887
Apply suggestion from code review
LilithHafner Oct 10, 2021
d82295d
Replace dimension checking with varargs `eachindex`
LilithHafner Oct 10, 2021
a4505b2
Test addcounts! on row vector
LilithHafner Oct 10, 2021
c334d4d
Support multidimensional arrays for :radixsort
LilithHafner Oct 10, 2021
94b3e6c
fix lastindex _addcounts_radix_sort_loop! indexing
LilithHafner Oct 11, 2021
87074b2
organize and extend testing; revert to flattening x when passed weigh…
LilithHafner Oct 11, 2021
aa21bc9
removed todo list
LilithHafner Oct 11, 2021
309586f
Update src/counts.jl
LilithHafner Oct 11, 2021
3b5e01d
test :radixsort multidimensional array
LilithHafner Oct 11, 2021
3384f45
docstrings: homogonize, correct, explain proportionmap, shrink onelin…
LilithHafner Oct 11, 2021
6b81b51
whitespace
LilithHafner Oct 11, 2021
18b70b5
test axes(::Weights)
LilithHafner Oct 12, 2021
7aa2d79
Merge branch 'JuliaStats:master' into fix-offset-array
LilithHafner Oct 14, 2021
93caf72
fix tests
LilithHafner Oct 14, 2021
62c5967
Apply suggestions from code review
LilithHafner Oct 14, 2021
1238a2d
put back dimension-mismatch (add messages)
LilithHafner Oct 23, 2021
c1c09d1
test formatting
LilithHafner Oct 23, 2021
bc20432
better error messages
LilithHafner Oct 23, 2021
cbfd92e
test :dict on reshape as well
LilithHafner Oct 23, 2021
60c0aad
typos (fix tests)
LilithHafner Oct 23, 2021
c72b1aa
fixed typo in tests that stopped a test from running
LilithHafner Oct 23, 2021
e4ca264
conform to nalimilan's style
LilithHafner Oct 23, 2021
00fe408
whitespace
LilithHafner Oct 23, 2021
a470531
rename x -> xv to avoid type instability
LilithHafner Oct 23, 2021
aca38a3
test multidimensional countmap input with vector weights
LilithHafner Oct 23, 2021
168b3ab
docstring remove 'when x is a vector'
LilithHafner Oct 23, 2021
204f89a
sum of weights to proportion of weights in docstrings where applicable
LilithHafner Oct 23, 2021
3f3e657
use nalimilan's word choice
LilithHafner Oct 24, 2021
657c1ec
docstring typo
LilithHafner Oct 24, 2021
2ebf1f3
nalimilan's style preference
LilithHafner Oct 24, 2021
f469f0c
vectorize before sorting in iterator case; remove @boundcheck annotat…
LilithHafner Oct 26, 2021
6cf64d9
skip OffsetArray tests before 1.6
LilithHafner Oct 26, 2021
dc1f7ae
Update src/counts.jl
LilithHafner May 17, 2022
4fd45d9
Apply suggestions from code review
LilithHafner May 17, 2022
d31dcce
Merge branch 'JuliaStats:master' into fix-offset-array
LilithHafner May 17, 2022
0fd1c27
Update src/counts.jl
LilithHafner May 17, 2022
fbdf564
Update src/counts.jl
LilithHafner May 19, 2022
e1fab99
use firstindex instead of 1 for robustness to future type signature e…
Jun 2, 2022
717c795
Switch from SortingAlgorithms to Base's radix sort in Julia 1.9+ (clo…
Jun 6, 2022
ba59cc3
Apply suggestions from code review
LilithHafner Jun 6, 2022
2dfba9f
Fix typo
LilithHafner Jun 6, 2022
8f446ee
Style
LilithHafner Jun 6, 2022
e20e28c
Minor fixes
nalimilan Jun 8, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion Project.toml
Original file line number Diff line number Diff line change
Expand Up @@ -28,8 +28,9 @@ julia = "1"
[extras]
Dates = "ade2ca70-3891-5945-98fb-dc099432e06a"
DelimitedFiles = "8bb1440f-4735-579b-a4ab-409b98df4dab"
OffsetArrays = "6fe1bfb0-de20-5000-8ca7-80f57d26f881"
StableRNGs = "860ef19b-820b-49d6-a774-d7a799459cd3"
Test = "8dfed614-e22c-5e08-85e1-65c5234f0b40"

[targets]
test = ["Dates", "DelimitedFiles", "StableRNGs", "Test"]
test = ["Dates", "DelimitedFiles", "OffsetArrays", "StableRNGs", "Test"]
194 changes: 121 additions & 73 deletions src/counts.jl
Original file line number Diff line number Diff line change
Expand Up @@ -16,24 +16,24 @@ end

#### functions for counting a single list of integers (1D)
"""
addcounts!(r, x, levels::UnitRange{<:Int}, [wv::AbstractWeights])
addcounts!(r, x, levels::UnitRange{<:Integer}, [wv::AbstractWeights])

Add the number of occurrences in `x` of each value in `levels` to an existing
array `r`. If a weighting vector `wv` is specified, the sum of weights is used
rather than the raw counts.
array `r`. For each `xi ∈ x`, if `xi == levels[j]`, then we increment `r[j]`.

If a weighting vector `wv` is specified, the sum of weights is used rather than the
raw counts.
"""
function addcounts!(r::AbstractArray, x::IntegerArray, levels::IntUnitRange)
# add counts of integers from x to r
# add counts of integers from x that fall within levels to r

k = length(levels)
length(r) == k || throw(DimensionMismatch())
checkbounds(r, axes(levels)...)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think

Suggested change
checkbounds(r, axes(levels)...)
(firstindex(r) == 1) && length(r) == k) || throw(DimensionMismatch())

better expresses the intention in the original code. Also maybe this condition should be noted in the documentation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the previous version errored when r was larger than levels and if we keep that restriction, it should be noted in the documentation.

However, it feels more natural to me not to keep that restriction, especially since it is not documented. So long as what we are adding to r fits into r, this method will work and other entries of r will not be affected. This looser restriction and guarantee not to mess with other entries of r is implicit and need not be mentioned in the docstring.

Copy link
Contributor

@bkamins bkamins May 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue is broader. With your code if r is OffsetArray then indices from 1 to length(levels) must be valid in it. But then you are adding values to r in indices starting from 1, which in general does not have to be first element of r. The docstring does not specify where the elements are put into r. There can be two options:

  • starting from index 1 or
  • starting from beginning of r

That is why I propose to add firstindex(r)==1 condition which resolves this issue as then both options are the same.
I agree that we could allow r to be longer than length(levels) (this is a change, but I think it is acceptable and makes sense).

Copy link
Contributor Author

@LilithHafner LilithHafner May 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good observation. There could be an ambiguity here. I see two solutions:

  1. document the choice we make
  2. throw an error on these cases (e.g. Base.require_one_based_indexing(r))

I prefer option 1 because I think the choice made here is distinctly more natural than the choice to start at firstindex(r)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Option 1 sounds fine, the way r is used should have been explained better anyway in the current version (one could have imagined that the value of each level was used as to index into r rather than starting at 1). Allowing r to be longer than length(levels) sounds OK too as long as it's documented.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes - it is chcecked. My point is that it should be documented.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kalmarek

But given that I raised similar issue two days ago (in comments just below) there might be some merit in documenting it ;)

Could you link that issue? I'm not seeing it and want to make sure the documentation I added addresses it.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it wasn't a gh issue, but in my comments about addcounts! I missed the fact that checkbounds forces r to accept indices from 1 to lenght(levels), since levels::IntUnitRange. This is something I'd call "too clever to maintain" in the long run, unless explicitly documented.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For completeness, here is the link: #722 (comment)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This thread is about ambiguity in behavior when passed an r that uses offset indexing. That ambiguity is resolved in the current documentation: "For each xi ∈ x, if xi == levels[j], then we increment r[j]". This is a different issue than the one @kalmarek raised in the linked comment which is about whether this method could segfault on certain nonstandard indices.

I pushed another commit that should address @kalmarek's latest critique. Even though we know firstindex(levels) == 1, we can still write firstindex(levels) instead of 1. (This should future proof this method in the event that Julia switches to 2 based indexing at any point in the future :), or, much more plausibly, if IntUnitRange ever allows offset indices.)


m0 = levels[1]
m1 = levels[end]
b = m0 - 1
m0 = first(levels)
m1 = last(levels)
b = m0 - firstindex(levels) # firstindex(levels) == 1 because levels::IntUnitRange

@inbounds for i in 1 : length(x)
xi = x[i]
@inbounds for xi in x
if m0 <= xi <= m1
r[xi - b] += 1
end
Expand All @@ -42,15 +42,21 @@ function addcounts!(r::AbstractArray, x::IntegerArray, levels::IntUnitRange)
end

function addcounts!(r::AbstractArray, x::IntegerArray, levels::IntUnitRange, wv::AbstractWeights)
k = length(levels)
length(r) == k || throw(DimensionMismatch())
# add wv weighted counts of integers from x that fall within levels to r

length(x) == length(wv) ||
throw(DimensionMismatch("x and wv must have the same length, got $(length(x)) and $(length(wv))"))

m0 = levels[1]
m1 = levels[end]
xv = vec(x) # discard shape because weights() discards shape

checkbounds(r, axes(levels)...)
LilithHafner marked this conversation as resolved.
Show resolved Hide resolved

m0 = first(levels)
m1 = last(levels)
b = m0 - 1

@inbounds for i in 1 : length(x)
xi = x[i]
@inbounds for i in eachindex(xv, wv)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will fail if x is OffsetArray:

julia> eachindex(vec(OffsetArray(1:4, 3)), Weights(1:4))
ERROR: DimensionMismatch

Copy link
Contributor Author

@LilithHafner LilithHafner May 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously, addcounts! of an OffsetArray and a non-offset Weights resulted in a segfault. In general, it is a nontrivial interface decision whether to silent support mismatched indices. In this particular case, because it is not a regression to throw a dimension mismatch instead of segfaulting, I'd like to call that error out of scope.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK - I thought you wanted to fix OffsetArrays.jl issues in this PR, but agreed it can go to a separate PR since what you propose now is safe.

However note that in:

xv = vec(x) # discard shape because weights() discards shape

you already decide on the interface you mention - choosing to support "position in sequence" interpretation (if x is higher dimensional you implicitly perform this matching)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That choice was forced to avoid a regression on addcounts(zeros(Int, 5), rand(1:5, 3, 3), Weights(rand(9))).

It is also an implicit conversion to linear indexing, which is somewhat different than an implicit alignment of mismatched indices.

OK - I thought you wanted to fix OffsetArrays.jl issues in this PR

I don't see this as an issue:

julia> w = Weights([4,2,6]);

julia> ow = Weights(OffsetVector([4,2,6], 2));

julia> x = [3,1,2];

julia> ox = OffsetVector([3,1,2], 2);

julia> addcounts!(zeros(Int, 3), x, 1:3, w)
3-element Vector{Int64}:
 2
 6
 4

julia> addcounts!(zeros(Int, 3), ox, 1:3, ow)
3-element Vector{Int64}:
 2
 6
 4

julia> addcounts!(zeros(Int, 3), ox, 1:3, w)
ERROR: DimensionMismatch: all inputs to eachindex must have the same axes, got (OffsetArrays.IdOffsetRange(values=3:5, indices=3:5),) and (Base.OneTo(3),)
Stacktrace:
 [1] throw_eachindex_mismatch_indices(::IndexCartesian, ::Tuple{OffsetArrays.IdOffsetRange{Int64, Base.OneTo{Int64}}}, ::Vararg{Any})
   @ Base ./abstractarray.jl:292
 [2] eachindex
   @ ./multidimensional.jl:388 [inlined]
 [3] eachindex
   @ ./abstractarray.jl:334 [inlined]
 [4] addcounts!(r::Vector{Int64}, x::OffsetVector{Int64, Vector{Int64}}, levels::UnitRange{Int64}, wv::Weights{Int64, Int64, Vector{Int64}})
   @ StatsBase ~/.julia/dev/StatsBase/src/counts.jl:58
 [5] top-level scope
   @ REPL[24]:1

julia> addcounts!(zeros(Int, 3), x, 1:3, ow)
ERROR: DimensionMismatch: all inputs to eachindex must have the same axes, got (Base.OneTo(3),) and (OffsetArrays.IdOffsetRange(values=3:5, indices=3:5),)
Stacktrace:
 [1] throw_eachindex_mismatch_indices(::IndexCartesian, ::Tuple{Base.OneTo{Int64}}, ::Vararg{Any})
   @ Base ./abstractarray.jl:292
 [2] eachindex
   @ ./multidimensional.jl:388 [inlined]
 [3] eachindex
   @ ./abstractarray.jl:334 [inlined]
 [4] addcounts!(r::Vector{Int64}, x::Vector{Int64}, levels::UnitRange{Int64}, wv::Weights{Int64, Int64, OffsetVector{Int64, Vector{Int64}}})
   @ StatsBase ~/.julia/dev/StatsBase/src/counts.jl:58
 [5] top-level scope
   @ REPL[25]:1

julia> 

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To allow different index styles (but same length) we could do:

for (xi, wi) in zip(xv, wv)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

zip is too tolerant, it even allows inputs of different sizes and silently ignores extra elements in the other argument. As mentioned in the main discussion, we should probably be strict about having matching indices or things could be really confusing.

xi = xv[i]
if m0 <= xi <= m1
r[xi - b] += wv[i]
end
Expand All @@ -69,8 +75,8 @@ falling in that range will be considered (the others will be ignored without
raising an error or a warning). If an integer `k` is provided, only values in the
range `1:k` will be considered.

If a weighting vector `wv` is specified, the sum of the weights is used rather than the
raw counts.
If a vector of weights `wv` is provided, the proportion of weights is computed rather
than the proportion of raw counts.

The output is a vector of length `length(levels)`.
"""
Expand All @@ -90,8 +96,10 @@ counts(x::IntegerArray, wv::AbstractWeights) = counts(x, span(x), wv)
proportions(x, levels=span(x), [wv::AbstractWeights])

Return the proportion of values in the range `levels` that occur in `x`.
Equivalent to `counts(x, levels) / length(x)`. If a weighting vector `wv`
is specified, the sum of the weights is used rather than the raw counts.
Equivalent to `counts(x, levels) / length(x)`.

If a vector of weights `wv` is provided, the proportion of weights is computed rather
than the proportion of raw counts.
"""
proportions(x::IntegerArray, levels::IntUnitRange) = counts(x, levels) .* inv(length(x))
proportions(x::IntegerArray, levels::IntUnitRange, wv::AbstractWeights) =
Expand All @@ -101,6 +109,9 @@ proportions(x::IntegerArray, levels::IntUnitRange, wv::AbstractWeights) =
proportions(x, k::Integer, [wv::AbstractWeights])

Return the proportion of integers in 1 to `k` that occur in `x`.

If a vector of weights `wv` is provided, the proportion of weights is computed rather
than the proportion of raw counts.
"""
proportions(x::IntegerArray, k::Integer) = proportions(x, 1:k)
proportions(x::IntegerArray, k::Integer, wv::AbstractWeights) = proportions(x, 1:k, wv)
Expand All @@ -110,26 +121,22 @@ proportions(x::IntegerArray, wv::AbstractWeights) = proportions(x, span(x), wv)
#### functions for counting a single list of integers (2D)

function addcounts!(r::AbstractArray, x::IntegerArray, y::IntegerArray, levels::NTuple{2,IntUnitRange})
# add counts of integers from x to r

n = length(x)
length(y) == n || throw(DimensionMismatch())
# add counts of pairs from zip(x,y) to r

xlevels, ylevels = levels

kx = length(xlevels)
ky = length(ylevels)
size(r) == (kx, ky) || throw(DimensionMismatch())

mx0 = xlevels[1]
mx1 = xlevels[end]
my0 = ylevels[1]
my1 = ylevels[end]
checkbounds(r, axes(xlevels, 1), axes(ylevels, 1))

mx0 = first(xlevels)
mx1 = last(xlevels)
my0 = first(ylevels)
my1 = last(ylevels)

bx = mx0 - 1
by = my0 - 1

for i = 1:n
for i in eachindex(vec(x), vec(y))
xi = x[i]
yi = y[i]
if (mx0 <= xi <= mx1) && (my0 <= yi <= my1)
Expand All @@ -141,28 +148,31 @@ end

function addcounts!(r::AbstractArray, x::IntegerArray, y::IntegerArray,
levels::NTuple{2,IntUnitRange}, wv::AbstractWeights)
# add counts of integers from x to r
# add counts of pairs from zip(x,y) to r

length(x) == length(y) == length(wv) ||
throw(DimensionMismatch("x, y, and wv must have the same length, but got $(length(x)), $(length(y)), and $(length(wv))"))

axes(x) == axes(y) ||
throw(DimensionMismatch("x and y must have the same axes, but got $(axes(x)) and $(axes(y))"))

n = length(x)
length(y) == length(wv) == n || throw(DimensionMismatch())
xv, yv = vec(x), vec(y) # discard shape because weights() discards shape

xlevels, ylevels = levels

kx = length(xlevels)
ky = length(ylevels)
size(r) == (kx, ky) || throw(DimensionMismatch())
checkbounds(r, axes(xlevels, 1), axes(ylevels, 1))

mx0 = xlevels[1]
mx1 = xlevels[end]
my0 = ylevels[1]
my1 = ylevels[end]
mx0 = first(xlevels)
mx1 = last(xlevels)
my0 = first(ylevels)
my1 = last(ylevels)

bx = mx0 - 1
by = my0 - 1

for i = 1:n
xi = x[i]
yi = y[i]
for i in eachindex(xv, yv, wv)
xi = xv[i]
yi = yv[i]
if (mx0 <= xi <= mx1) && (my0 <= yi <= my1)
r[xi - bx, yi - by] += wv[i]
end
Expand Down Expand Up @@ -235,13 +245,15 @@ end


"""
addcounts!(dict, x[, wv]; alg = :auto)
addcounts!(dict, x; alg = :auto)
addcounts!(dict, x, wv)

Add counts based on `x` to a count map. New entries will be added if new values come up.

If a weighting vector `wv` is specified, the sum of the weights is used rather than the
raw counts.

`alg` can be one of:
`alg` is only allowed for unweighted counting and can be one of:
- `:auto` (default): if `StatsBase.radixsort_safe(eltype(x)) == true` then use
`:radixsort`, otherwise use `:dict`.

Expand Down Expand Up @@ -284,9 +296,9 @@ function addcounts_dict!(cm::Dict{T}, x) where T
end

# If the bits type is of small size i.e. it can have up to 65536 distinct values
# then it is always better to apply a counting-sort like reduce algorithm for
# then it is always better to apply a counting-sort like reduce algorithm for
# faster results and less memory usage. However we still wish to enable others
# to write generic algorithms, therefore the methods below still accept the
# to write generic algorithms, therefore the methods below still accept the
# `alg` argument but it is ignored.
function _addcounts!(::Type{Bool}, cm::Dict{Bool}, x::AbstractArray{Bool}; alg = :ignored)
sumx = sum(x)
Expand Down Expand Up @@ -335,32 +347,42 @@ const BaseRadixSortSafeTypes = Union{Int8, Int16, Int32, Int64, Int128,
"Can the type be safely sorted by radixsort"
radixsort_safe(::Type{T}) where T = T<:BaseRadixSortSafeTypes

function _addcounts_radix_sort_loop!(cm::Dict{T}, sx::AbstractArray{T}) where T
function _addcounts_radix_sort_loop!(cm::Dict{T}, sx::AbstractVector{T}) where T
LilithHafner marked this conversation as resolved.
Show resolved Hide resolved
isempty(sx) && return cm
last_sx = sx[1]
tmpcount = get(cm, last_sx, 0) + 1
last_sx = first(sx)
start_i = firstindex(sx)

# now the data is sorted: can just run through and accumulate values before
# adding into the Dict
@inbounds for i in 2:length(sx)
@inbounds for i in start_i+1:lastindex(sx)
LilithHafner marked this conversation as resolved.
Show resolved Hide resolved
sxi = sx[i]
if last_sx == sxi
tmpcount += 1
else
cm[last_sx] = tmpcount
if last_sx != sxi
cm[last_sx] = get(cm, last_sx, 0) + i - start_i
last_sx = sxi
tmpcount = get(cm, last_sx, 0) + 1
start_i = i
end
end

cm[sx[end]] = tmpcount
last_sx = last(sx)
cm[last_sx] = get(cm, last_sx, 0) + lastindex(sx) + 1 - start_i

return cm
end

function _alg(x::AbstractArray)
@static if VERSION >= v"1.9.0-DEV"
Base.DEFAULT_UNSTABLE
else
firstindex(x) == 1 ||
throw(ArgumentError("alg=:radixsort requires either one based indexing or Julia >= 1.9. "
LilithHafner marked this conversation as resolved.
Show resolved Hide resolved
* "Use `alg = :dict` as an alternative."))
SortingAlgorithms.RadixSort
end
end

function addcounts_radixsort!(cm::Dict{T}, x::AbstractArray{T}) where T
# sort the x using radixsort
sx = sort(x, alg = RadixSort)
sx = sort(vec(x), alg=_alg(x))

# Delegate the loop to a separate function since sort might not
# be inferred in Julia 0.6 after SortingAlgorithms is loaded.
Expand All @@ -369,18 +391,24 @@ function addcounts_radixsort!(cm::Dict{T}, x::AbstractArray{T}) where T
end

# fall-back for `x` an iterator
function addcounts_radixsort!(cm::Dict{T}, x) where T
sx = sort!(collect(x), alg = RadixSort)
function addcounts_radixsort!(cm::Dict{T}, x) where T
cx = vec(collect(x))
sx = sort!(cx, alg = _alg(cx))
return _addcounts_radix_sort_loop!(cm, sx)
end

function addcounts!(cm::Dict{T}, x::AbstractArray{T}, wv::AbstractVector{W}) where {T,W<:Real}
n = length(x)
length(wv) == n || throw(DimensionMismatch())
# add wv weighted counts of integers from x to cm

LilithHafner marked this conversation as resolved.
Show resolved Hide resolved
length(x) == length(wv) ||
throw(DimensionMismatch("x and wv must have the same length, got $(length(x)) and $(length(wv))"))

xv = vec(x) # discard shape because weights() discards shape
nalimilan marked this conversation as resolved.
Show resolved Hide resolved
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about this again, maybe we should deprecate this behavior by adding a separate method like this:

@deprecate addcounts!(cm::Dict{T}, x::AbstractArray{T}, wv::AbstractVector{W}) where {T,W<:Real} = addcounts!(cm, vec(x), wv)

And restrict the signature of this method to x::AbstractVector{T}.


z = zero(W)

for i = 1 : n
@inbounds xi = x[i]
for i in eachindex(xv, wv)
@inbounds xi = xv[i]
@inbounds wi = wv[i]
cm[xi] = get(cm, xi, z) + wi
end
Expand All @@ -390,11 +418,14 @@ end

"""
countmap(x; alg = :auto)
countmap(x::AbstractVector, w::AbstractVector{<:Real}; alg = :auto)
countmap(x::AbstractVector, wv::AbstractVector{<:Real})

Return a dictionary mapping each unique value in `x` to its number
of occurrences. A vector of weights `w` can be provided when `x` is a vector.
Return a dictionary mapping each unique value in `x` to its number of occurrences.

If a weighting vector `wv` is specified, the sum of weights is used rather than the
raw counts.

`alg` is only allowed for unweighted counting and can be one of:
- `:auto` (default): if `StatsBase.radixsort_safe(eltype(x)) == true` then use
`:radixsort`, otherwise use `:dict`.

Expand All @@ -413,10 +444,27 @@ countmap(x::AbstractArray{T}, wv::AbstractVector{W}) where {T,W<:Real} = addcoun


"""
proportionmap(x)
proportionmap(x; alg = :auto)
proportionmap(x::AbstractVector, w::AbstractVector{<:Real})

Return a dictionary mapping each unique value in `x` to its proportion in `x`.

If a vector of weights `wv` is provided, the proportion of weights is computed rather
than the proportion of raw counts.

`alg` is only allowed for unweighted counting and can be one of:
- `:auto` (default): if `StatsBase.radixsort_safe(eltype(x)) == true` then use
`:radixsort`, otherwise use `:dict`.

Return a dictionary mapping each unique value in `x` to its
proportion in `x`.
- `:radixsort`: if `radixsort_safe(eltype(x)) == true` then use the
[radix sort](https://en.wikipedia.org/wiki/Radix_sort)
algorithm to sort the input vector which will generally lead to
shorter running time. However the radix sort algorithm creates a
copy of the input vector and hence uses more RAM. Choose `:dict`
if the amount of available RAM is a limitation.

- `:dict`: use `Dict`-based method which is generally slower but uses less
RAM and is safe for any data type.
"""
proportionmap(x::AbstractArray) = _normalize_countmap(countmap(x), length(x))
proportionmap(x::AbstractArray, wv::AbstractWeights) = _normalize_countmap(countmap(x, wv), sum(wv))
2 changes: 2 additions & 0 deletions src/weights.jl
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@ length(wv::AbstractWeights) = length(wv.values)
sum(wv::AbstractWeights) = wv.sum
isempty(wv::AbstractWeights) = isempty(wv.values)
size(wv::AbstractWeights) = size(wv.values)
Base.axes(wv::AbstractWeights) = Base.axes(wv.values)
LilithHafner marked this conversation as resolved.
Show resolved Hide resolved
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this definition correct? We only support Integer indices in getindex and Int in setindex! below it seems, so non-standard axes that don't operate with Integers will yield unsupported indices it seems.

Copy link
Contributor Author

@LilithHafner LilithHafner May 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The AbstractArray interface requires that axes returns a tuple of AbstractUnitRange{<:Integer}s, so we can assume that their eltype will be an Integer. From the axes docstring: "Each of the indices has to be an AbstractUnitRange{<:Integer}".

As for Integers other than Int, As long as we implement getindex and setindex for Int as required by the AbstractArray interface, Julia's Base libraries will convert other Integers to Int as needed.

For example

using StatsBase
w = Weights([4,2,6])
w[0x2] = 1
display(w)
#=
3-element Weights{Int64, Int64, Vector{Int64}}:
 4
 1
 6
=#

The previous implicit definition (axes(wv::AbstractWeights) = (Base.OneTo(length(wv)),)) was not correct

using StatsBase, OffsetArrays
w = Weights(OffsetVector([4,2,6], 2))
display(w)
#=
3-element Weights{Int64, Int64, OffsetVector{Int64, Vector{Int64}}}:
          3 # unsafe read
 4972085568 # unsafe read
          4
=#

On this PR

using StatsBase, OffsetArrays
w = Weights(OffsetVector([4,2,6], 2))
display(w)
#=
3-element Weights{Int64, Int64, OffsetVector{Int64, Vector{Int64}}} with indices 3:5:
 4
 2
 6
=#

Copy link
Contributor

@bkamins bkamins May 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the discussion in #791 I would propose to add tests of the consequences of this change (e.g. making sure that getindex, setindex!, firstindex, lastindex, isequal etc. work correctly for non-1-based case)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change does not add any new functionality, it simply removes segfaults from existing functionality. @bkamins, would you be willing to create a PR adding those tests?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if StatsBase.jl maintainers are OK to merge this PR without these tests then I can add them later.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it's better to avoid known segfaults (or worse, incorrect behavior) even if not tested completely, so in the interest of avoiding this PR from being forgotten again we could merge without additional tests.


Base.convert(::Type{Vector}, wv::AbstractWeights) = convert(Vector, wv.values)

Expand Down Expand Up @@ -299,6 +300,7 @@ sum(wv::UnitWeights{T}) where T = convert(T, length(wv))
isempty(wv::UnitWeights) = iszero(wv.len)
length(wv::UnitWeights) = wv.len
size(wv::UnitWeights) = tuple(length(wv))
Base.axes(wv::UnitWeights) = tuple(Base.OneTo(length(wv)))

Base.convert(::Type{Vector}, wv::UnitWeights{T}) where {T} = ones(T, length(wv))

Expand Down
Loading