Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

make counting more robust to input datatype #722

Merged
merged 45 commits into from
Jun 8, 2022
Merged
Show file tree
Hide file tree
Changes from 4 commits
Commits
Show all changes
45 commits
Select commit Hold shift + click to select a range
3905f6e
support offset arrays and simplify _addcounts_radix_sort_loop
LilithHafner Oct 2, 2021
8eb5855
add tests and fix more occurances of unsupported offset arrays [ref](…
LilithHafner Oct 7, 2021
84ef887
Apply suggestion from code review
LilithHafner Oct 10, 2021
d82295d
Replace dimension checking with varargs `eachindex`
LilithHafner Oct 10, 2021
a4505b2
Test addcounts! on row vector
LilithHafner Oct 10, 2021
c334d4d
Support multidimensional arrays for :radixsort
LilithHafner Oct 10, 2021
94b3e6c
fix lastindex _addcounts_radix_sort_loop! indexing
LilithHafner Oct 11, 2021
87074b2
organize and extend testing; revert to flattening x when passed weigh…
LilithHafner Oct 11, 2021
aa21bc9
removed todo list
LilithHafner Oct 11, 2021
309586f
Update src/counts.jl
LilithHafner Oct 11, 2021
3b5e01d
test :radixsort multidimensional array
LilithHafner Oct 11, 2021
3384f45
docstrings: homogonize, correct, explain proportionmap, shrink onelin…
LilithHafner Oct 11, 2021
6b81b51
whitespace
LilithHafner Oct 11, 2021
18b70b5
test axes(::Weights)
LilithHafner Oct 12, 2021
7aa2d79
Merge branch 'JuliaStats:master' into fix-offset-array
LilithHafner Oct 14, 2021
93caf72
fix tests
LilithHafner Oct 14, 2021
62c5967
Apply suggestions from code review
LilithHafner Oct 14, 2021
1238a2d
put back dimension-mismatch (add messages)
LilithHafner Oct 23, 2021
c1c09d1
test formatting
LilithHafner Oct 23, 2021
bc20432
better error messages
LilithHafner Oct 23, 2021
cbfd92e
test :dict on reshape as well
LilithHafner Oct 23, 2021
60c0aad
typos (fix tests)
LilithHafner Oct 23, 2021
c72b1aa
fixed typo in tests that stopped a test from running
LilithHafner Oct 23, 2021
e4ca264
conform to nalimilan's style
LilithHafner Oct 23, 2021
00fe408
whitespace
LilithHafner Oct 23, 2021
a470531
rename x -> xv to avoid type instability
LilithHafner Oct 23, 2021
aca38a3
test multidimensional countmap input with vector weights
LilithHafner Oct 23, 2021
168b3ab
docstring remove 'when x is a vector'
LilithHafner Oct 23, 2021
204f89a
sum of weights to proportion of weights in docstrings where applicable
LilithHafner Oct 23, 2021
3f3e657
use nalimilan's word choice
LilithHafner Oct 24, 2021
657c1ec
docstring typo
LilithHafner Oct 24, 2021
2ebf1f3
nalimilan's style preference
LilithHafner Oct 24, 2021
f469f0c
vectorize before sorting in iterator case; remove @boundcheck annotat…
LilithHafner Oct 26, 2021
6cf64d9
skip OffsetArray tests before 1.6
LilithHafner Oct 26, 2021
dc1f7ae
Update src/counts.jl
LilithHafner May 17, 2022
4fd45d9
Apply suggestions from code review
LilithHafner May 17, 2022
d31dcce
Merge branch 'JuliaStats:master' into fix-offset-array
LilithHafner May 17, 2022
0fd1c27
Update src/counts.jl
LilithHafner May 17, 2022
fbdf564
Update src/counts.jl
LilithHafner May 19, 2022
e1fab99
use firstindex instead of 1 for robustness to future type signature e…
Jun 2, 2022
717c795
Switch from SortingAlgorithms to Base's radix sort in Julia 1.9+ (clo…
Jun 6, 2022
ba59cc3
Apply suggestions from code review
LilithHafner Jun 6, 2022
2dfba9f
Fix typo
LilithHafner Jun 6, 2022
8f446ee
Style
LilithHafner Jun 6, 2022
e20e28c
Minor fixes
nalimilan Jun 8, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
86 changes: 38 additions & 48 deletions src/counts.jl
Original file line number Diff line number Diff line change
Expand Up @@ -23,17 +23,15 @@ array `r`. If a weighting vector `wv` is specified, the sum of weights is used
rather than the raw counts.
"""
function addcounts!(r::AbstractArray, x::IntegerArray, levels::IntUnitRange)
# add counts of integers from x to r
# add counts of integers from x that fall within levels to r

k = length(levels)
length(r) == k || throw(DimensionMismatch())
@boundscheck checkbounds(r, axes(levels)...)
LilithHafner marked this conversation as resolved.
Show resolved Hide resolved

m0 = levels[1]
m1 = levels[end]
m0 = first(levels)
m1 = last(levels)
b = m0 - 1

@inbounds for i in 1 : length(x)
xi = x[i]
@inbounds for xi in x
if m0 <= xi <= m1
r[xi - b] += 1
end
Expand All @@ -42,14 +40,15 @@ function addcounts!(r::AbstractArray, x::IntegerArray, levels::IntUnitRange)
end

function addcounts!(r::AbstractArray, x::IntegerArray, levels::IntUnitRange, wv::AbstractWeights)
k = length(levels)
length(r) == k || throw(DimensionMismatch())
# add wv weighted counts of integers from x that fall within levels to r

@boundscheck checkbounds(r, axes(levels)...)
LilithHafner marked this conversation as resolved.
Show resolved Hide resolved

m0 = levels[1]
m1 = levels[end]
m0 = first(levels)
m1 = last(levels)
b = m0 - 1

@inbounds for i in 1 : length(x)
@inbounds for i in eachindex(x, wv)
xi = x[i]
if m0 <= xi <= m1
r[xi - b] += wv[i]
Expand Down Expand Up @@ -112,24 +111,21 @@ proportions(x::IntegerArray, wv::AbstractWeights) = proportions(x, span(x), wv)
function addcounts!(r::AbstractArray, x::IntegerArray, y::IntegerArray, levels::NTuple{2,IntUnitRange})
# add counts of integers from x to r

n = length(x)
length(y) == n || throw(DimensionMismatch())

xlevels, ylevels = levels

kx = length(xlevels)
ky = length(ylevels)
size(r) == (kx, ky) || throw(DimensionMismatch())
@boundscheck checkbounds(r, axes(xlevels, 1), axes(ylevels, 1))

mx0 = xlevels[1]
mx1 = xlevels[end]
my0 = ylevels[1]
my1 = ylevels[end]

mx0 = first(xlevels)
mx1 = last(xlevels)
my0 = first(ylevels)
my1 = last(ylevels)

bx = mx0 - 1
by = my0 - 1

for i = 1:n
for i in eachindex(x, y)
LilithHafner marked this conversation as resolved.
Show resolved Hide resolved
xi = x[i]
yi = y[i]
if (mx0 <= xi <= mx1) && (my0 <= yi <= my1)
Expand All @@ -143,24 +139,20 @@ function addcounts!(r::AbstractArray, x::IntegerArray, y::IntegerArray,
levels::NTuple{2,IntUnitRange}, wv::AbstractWeights)
# add counts of integers from x to r

n = length(x)
length(y) == length(wv) == n || throw(DimensionMismatch())

xlevels, ylevels = levels

kx = length(xlevels)
ky = length(ylevels)
size(r) == (kx, ky) || throw(DimensionMismatch())
@boundscheck checkbounds(r, axes(xlevels, 1), axes(ylevels, 1))

mx0 = xlevels[1]
mx1 = xlevels[end]
my0 = ylevels[1]
my1 = ylevels[end]
mx0 = first(xlevels)
mx1 = last(xlevels)
my0 = first(ylevels)
my1 = last(ylevels)

bx = mx0 - 1
by = my0 - 1

for i = 1:n
for i in eachindex(x, y, wv)
xi = x[i]
yi = y[i]
if (mx0 <= xi <= mx1) && (my0 <= yi <= my1)
Expand Down Expand Up @@ -284,9 +276,9 @@ function addcounts_dict!(cm::Dict{T}, x) where T
end

# If the bits type is of small size i.e. it can have up to 65536 distinct values
# then it is always better to apply a counting-sort like reduce algorithm for
# then it is always better to apply a counting-sort like reduce algorithm for
# faster results and less memory usage. However we still wish to enable others
# to write generic algorithms, therefore the methods below still accept the
# to write generic algorithms, therefore the methods below still accept the
# `alg` argument but it is ignored.
function _addcounts!(::Type{Bool}, cm::Dict{Bool}, x::AbstractArray{Bool}; alg = :ignored)
sumx = sum(x)
Expand Down Expand Up @@ -337,23 +329,22 @@ radixsort_safe(::Type{T}) where T = T<:BaseRadixSortSafeTypes

function _addcounts_radix_sort_loop!(cm::Dict{T}, sx::AbstractArray{T}) where T
LilithHafner marked this conversation as resolved.
Show resolved Hide resolved
isempty(sx) && return cm
last_sx = sx[1]
tmpcount = get(cm, last_sx, 0) + 1
last_sx = first(sx)
start_i = firstindex(sx)

# now the data is sorted: can just run through and accumulate values before
# adding into the Dict
@inbounds for i in 2:length(sx)
@inbounds for i in start_i+1:lastindex(sx)
LilithHafner marked this conversation as resolved.
Show resolved Hide resolved
sxi = sx[i]
if last_sx == sxi
tmpcount += 1
else
cm[last_sx] = tmpcount
if last_sx != sxi
cm[last_sx] = get(cm, last_sx, 0) + i - start_i
last_sx = sxi
tmpcount = get(cm, last_sx, 0) + 1
start_i = i
end
end

cm[sx[end]] = tmpcount
last_sx = last(sx)
cm[last_sx] = get(cm, last_sx, 0) + firstindex(sx) + 1 - start_i

return cm
end
Expand All @@ -369,17 +360,16 @@ function addcounts_radixsort!(cm::Dict{T}, x::AbstractArray{T}) where T
end

# fall-back for `x` an iterator
function addcounts_radixsort!(cm::Dict{T}, x) where T
function addcounts_radixsort!(cm::Dict{T}, x) where T
sx = sort!(collect(x), alg = RadixSort)
return _addcounts_radix_sort_loop!(cm, sx)
end

function addcounts!(cm::Dict{T}, x::AbstractArray{T}, wv::AbstractVector{W}) where {T,W<:Real}
n = length(x)
length(wv) == n || throw(DimensionMismatch())

LilithHafner marked this conversation as resolved.
Show resolved Hide resolved
z = zero(W)

for i = 1 : n
for i in eachindex(x, wv)
LilithHafner marked this conversation as resolved.
Show resolved Hide resolved
@inbounds xi = x[i]
@inbounds wi = wv[i]
cm[xi] = get(cm, xi, z) + wi
Expand Down Expand Up @@ -418,5 +408,5 @@ countmap(x::AbstractArray{T}, wv::AbstractVector{W}) where {T,W<:Real} = addcoun
Return a dictionary mapping each unique value in `x` to its
proportion in `x`.
"""
proportionmap(x::AbstractArray) = _normalize_countmap(countmap(x), length(x))
proportionmap(x::AbstractArray; alg = :auto) = _normalize_countmap(countmap(x; alg = alg), length(x))
LilithHafner marked this conversation as resolved.
Show resolved Hide resolved
LilithHafner marked this conversation as resolved.
Show resolved Hide resolved
LilithHafner marked this conversation as resolved.
Show resolved Hide resolved
proportionmap(x::AbstractArray, wv::AbstractWeights) = _normalize_countmap(countmap(x, wv), sum(wv))
3 changes: 2 additions & 1 deletion src/weights.jl
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@ length(wv::AbstractWeights) = length(wv.values)
sum(wv::AbstractWeights) = wv.sum
isempty(wv::AbstractWeights) = isempty(wv.values)
size(wv::AbstractWeights) = size(wv.values)
Base.axes(wv::AbstractWeights) = Base.axes(wv.values)
LilithHafner marked this conversation as resolved.
Show resolved Hide resolved
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this definition correct? We only support Integer indices in getindex and Int in setindex! below it seems, so non-standard axes that don't operate with Integers will yield unsupported indices it seems.

Copy link
Contributor Author

@LilithHafner LilithHafner May 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The AbstractArray interface requires that axes returns a tuple of AbstractUnitRange{<:Integer}s, so we can assume that their eltype will be an Integer. From the axes docstring: "Each of the indices has to be an AbstractUnitRange{<:Integer}".

As for Integers other than Int, As long as we implement getindex and setindex for Int as required by the AbstractArray interface, Julia's Base libraries will convert other Integers to Int as needed.

For example

using StatsBase
w = Weights([4,2,6])
w[0x2] = 1
display(w)
#=
3-element Weights{Int64, Int64, Vector{Int64}}:
 4
 1
 6
=#

The previous implicit definition (axes(wv::AbstractWeights) = (Base.OneTo(length(wv)),)) was not correct

using StatsBase, OffsetArrays
w = Weights(OffsetVector([4,2,6], 2))
display(w)
#=
3-element Weights{Int64, Int64, OffsetVector{Int64, Vector{Int64}}}:
          3 # unsafe read
 4972085568 # unsafe read
          4
=#

On this PR

using StatsBase, OffsetArrays
w = Weights(OffsetVector([4,2,6], 2))
display(w)
#=
3-element Weights{Int64, Int64, OffsetVector{Int64, Vector{Int64}}} with indices 3:5:
 4
 2
 6
=#

Copy link
Contributor

@bkamins bkamins May 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the discussion in #791 I would propose to add tests of the consequences of this change (e.g. making sure that getindex, setindex!, firstindex, lastindex, isequal etc. work correctly for non-1-based case)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change does not add any new functionality, it simply removes segfaults from existing functionality. @bkamins, would you be willing to create a PR adding those tests?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if StatsBase.jl maintainers are OK to merge this PR without these tests then I can add them later.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it's better to avoid known segfaults (or worse, incorrect behavior) even if not tested completely, so in the interest of avoiding this PR from being forgotten again we could merge without additional tests.


Base.convert(::Type{Vector}, wv::AbstractWeights) = convert(Vector, wv.values)

Expand Down Expand Up @@ -310,7 +311,7 @@ julia> uweights(3)
1
1
1

julia> uweights(Float64, 3)
3-element UnitWeights{Float64}:
1.0
Expand Down
24 changes: 23 additions & 1 deletion test/counts.jl
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
using StatsBase
using Test
using OffsetArrays

n = 5000

Expand Down Expand Up @@ -104,7 +105,7 @@ cm_missing = countmap(skipmissing(xx))
@test cm_missing isa Dict{Int, Int}
@test cm_missing == cm

cm_any_itr = countmap((i for i in xx))
cm_any_itr = countmap((i for i in xx))
@test cm_any_itr isa Dict{Any,Int} # no knowledge about type
@test cm_missing == cm

Expand Down Expand Up @@ -164,3 +165,24 @@ end
X = view([1,1,1,2,2], 1:5)
@test countmap(X) == countmap(copy(X))
end

@testset "offset arrays" begin
x = rand(1:5, n)
w = rand(n)
xw = weights(w)
y = OffsetArray(x, n÷2)
yw = weights(OffsetArray(w, n÷2))
z = OffsetArray(x, -2n)
zw = weights(OffsetArray(w, -2n))

# proportions calls counts which calls addcounts!
@test proportions(x) == proportions(y) == proportions(z)
@test proportions(x, xw) == proportions(y, yw) == proportions(z, zw)
@test proportionmap(x) == proportionmap(y) == proportionmap(z)
@test proportionmap(x, xw) == proportionmap(y, yw) == proportionmap(z, zw)
@test (proportionmap(x) == proportionmap(x; alg = :dict) == proportionmap(x; alg = :radixsort)
== proportionmap(y) == proportionmap(y; alg = :dict) == proportionmap(y; alg = :radixsort)
== proportionmap(z) == proportionmap(z; alg = :dict) == proportionmap(z; alg = :radixsort))
@test proportionmap(x, xw) == proportionmap(y, yw) == proportionmap(z, zw)
# countmap and proportionmap only support the :dict algorithm for weighted sums.
end