Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

make counting more robust to input datatype #722

Merged
merged 45 commits into from
Jun 8, 2022
Merged
Changes from 1 commit
Commits
Show all changes
45 commits
Select commit Hold shift + click to select a range
3905f6e
support offset arrays and simplify _addcounts_radix_sort_loop
LilithHafner Oct 2, 2021
8eb5855
add tests and fix more occurances of unsupported offset arrays [ref](…
LilithHafner Oct 7, 2021
84ef887
Apply suggestion from code review
LilithHafner Oct 10, 2021
d82295d
Replace dimension checking with varargs `eachindex`
LilithHafner Oct 10, 2021
a4505b2
Test addcounts! on row vector
LilithHafner Oct 10, 2021
c334d4d
Support multidimensional arrays for :radixsort
LilithHafner Oct 10, 2021
94b3e6c
fix lastindex _addcounts_radix_sort_loop! indexing
LilithHafner Oct 11, 2021
87074b2
organize and extend testing; revert to flattening x when passed weigh…
LilithHafner Oct 11, 2021
aa21bc9
removed todo list
LilithHafner Oct 11, 2021
309586f
Update src/counts.jl
LilithHafner Oct 11, 2021
3b5e01d
test :radixsort multidimensional array
LilithHafner Oct 11, 2021
3384f45
docstrings: homogonize, correct, explain proportionmap, shrink onelin…
LilithHafner Oct 11, 2021
6b81b51
whitespace
LilithHafner Oct 11, 2021
18b70b5
test axes(::Weights)
LilithHafner Oct 12, 2021
7aa2d79
Merge branch 'JuliaStats:master' into fix-offset-array
LilithHafner Oct 14, 2021
93caf72
fix tests
LilithHafner Oct 14, 2021
62c5967
Apply suggestions from code review
LilithHafner Oct 14, 2021
1238a2d
put back dimension-mismatch (add messages)
LilithHafner Oct 23, 2021
c1c09d1
test formatting
LilithHafner Oct 23, 2021
bc20432
better error messages
LilithHafner Oct 23, 2021
cbfd92e
test :dict on reshape as well
LilithHafner Oct 23, 2021
60c0aad
typos (fix tests)
LilithHafner Oct 23, 2021
c72b1aa
fixed typo in tests that stopped a test from running
LilithHafner Oct 23, 2021
e4ca264
conform to nalimilan's style
LilithHafner Oct 23, 2021
00fe408
whitespace
LilithHafner Oct 23, 2021
a470531
rename x -> xv to avoid type instability
LilithHafner Oct 23, 2021
aca38a3
test multidimensional countmap input with vector weights
LilithHafner Oct 23, 2021
168b3ab
docstring remove 'when x is a vector'
LilithHafner Oct 23, 2021
204f89a
sum of weights to proportion of weights in docstrings where applicable
LilithHafner Oct 23, 2021
3f3e657
use nalimilan's word choice
LilithHafner Oct 24, 2021
657c1ec
docstring typo
LilithHafner Oct 24, 2021
2ebf1f3
nalimilan's style preference
LilithHafner Oct 24, 2021
f469f0c
vectorize before sorting in iterator case; remove @boundcheck annotat…
LilithHafner Oct 26, 2021
6cf64d9
skip OffsetArray tests before 1.6
LilithHafner Oct 26, 2021
dc1f7ae
Update src/counts.jl
LilithHafner May 17, 2022
4fd45d9
Apply suggestions from code review
LilithHafner May 17, 2022
d31dcce
Merge branch 'JuliaStats:master' into fix-offset-array
LilithHafner May 17, 2022
0fd1c27
Update src/counts.jl
LilithHafner May 17, 2022
fbdf564
Update src/counts.jl
LilithHafner May 19, 2022
e1fab99
use firstindex instead of 1 for robustness to future type signature e…
Jun 2, 2022
717c795
Switch from SortingAlgorithms to Base's radix sort in Julia 1.9+ (clo…
Jun 6, 2022
ba59cc3
Apply suggestions from code review
LilithHafner Jun 6, 2022
2dfba9f
Fix typo
LilithHafner Jun 6, 2022
8f446ee
Style
LilithHafner Jun 6, 2022
e20e28c
Minor fixes
nalimilan Jun 8, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 4 additions & 11 deletions src/counts.jl
Original file line number Diff line number Diff line change
Expand Up @@ -43,13 +43,12 @@ function addcounts!(r::AbstractArray, x::IntegerArray, levels::IntUnitRange, wv:
# add wv weighted counts of integers from x that fall within levels to r

@boundscheck checkbounds(r, axes(levels)...)
LilithHafner marked this conversation as resolved.
Show resolved Hide resolved
@boundscheck axes(x) == axes(wv) || throw(DimensionMismatch("x and wv must have the same axes"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these are worth keeping (here and elsewhere) as they mention the names of the arguments whose dimensions don't match, which eachindex cannot do. Also in this particular case you've kept @inbounds so no check is done AFAICT.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still thinking about this. I think the current error will look like this:

julia> countmap([1,2], [3,4,5])
ERROR: DimensionMismatch("all inputs to eachindex must have the same indices, got Base.OneTo(2) and Base.OneTo(3)")
Stacktrace:
 [1] throw_eachindex_mismatch_indices(::IndexLinear, ::Base.OneTo{Int64}, ::Vararg{Base.OneTo{Int64}, N} where N)
   @ Base ./abstractarray.jl:260
 [2] eachindex
   @ ./abstractarray.jl:316 [inlined]
 [3] eachindex
   @ ./abstractarray.jl:305 [inlined]
 [4] addcounts!(cm::Dict{Int64, Int64}, x::Vector{Int64}, wv::Vector{Int64})
   @ StatsBase ~/.julia/dev/StatsBase/src/counts.jl:382
 [5] countmap(x::Vector{Int64}, wv::Vector{Int64})
   @ StatsBase ~/.julia/dev/StatsBase/src/counts.jl:415
 [6] top-level scope
   @ none:1

julia> @inbounds eachindex([1,2], [3,4,5])
ERROR: DimensionMismatch("all inputs to eachindex must have the same indices, got Base.OneTo(2) and Base.OneTo(3)")
Stacktrace:
 [1] throw_eachindex_mismatch_indices(::IndexLinear, ::Base.OneTo{Int64}, ::Vararg{Base.OneTo{Int64}, N} where N)
   @ Base ./abstractarray.jl:260
 [2] eachindex
   @ ./abstractarray.jl:316 [inlined]
 [3] eachindex(A::Vector{Int64}, B::Vector{Int64})
   @ Base ./abstractarray.jl:305
 [4] top-level scope
   @ none:1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If someone is calling countmap directly in their code with axes(x) ≠ axes(wv), then the message "x and wv must have the same axes" is more helpful. If, on the other hand, the code causing the error is more removed (e.g.

julia> using Something
help?> f
    f(a, b)
does something and calls countmap(a,b) internally.
julia> f([1,2],[3,4,5])
error...

) then "all inputs to eachindex must have the same indices, got Base.OneTo(2) and Base.OneTo(3)" might be easier to understand because it lists the indices. The question is then if we want to manually lift the error message a level to update the parameter names, to which I lean toward no.

It depends on how countmap is used.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in general it's always better to throw the error as soon as possible rather than letting it be thrown by another function deeper in the call stack. Otherwise it's hard to understand what happened (and you may even think StatsBase is buggy), e.g. here eachindex isn't the problem so it shouldn't appear in the error or the stack trace. (This is also why a common pattern when implementing custom array types is to use @boundschecks checkbounds(...); @inbounds ... so that the error is thrown by the custom array method rather than by the methods it calls internally.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That said, it's also possible and nice for users to print the indices in the error message you throw.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because dimension checking is not in an @boundscheck This would result in redundant dimension checking when the dimensions do match (the common case).

If we think about dimension checking (and bounds checking) as algorithmic operations that have a performance impact and should be performed only once for both style and performance reasons, then this is inelegant. This is how I currently think about bounds and dimension checking.

If we think about bounds and dimension checking like type assertions, a fun, effective, free, and ultimately optional way of formally commenting on code to help readers, error messages, and occasionally performance, then throwing in an extra dimension check is a good idea. This is how I would like to think about bounds and dimension checking, but I don't know if the compiler can elide redundant checks and the syntax for additional bounds and dimension checking is a lot less clean than ::Integer or a hypothetical ::InBounds.

I've only heard of this hoisting in @vchuravy's comment JuliaLang/julia#42521 (comment), and don't know where to look for details, feasibility & potential timelines, but it sounds enticing. It would be nice to obviate this whole discussion with a type error at the top of the call hierarchy.

This being a real package that people actually use, and dimension checking being cheap in practice (I think), makes me think that your suggestion of keeping the original errors (probably extended to follow eachindex's style of showing input values) is the best way to go pragmatically, even if there is theoretically a better way on the horizon or this way is technically suboptimal for now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've put back eager dimension checking and added error messages.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The compiler is generally good at removing redundant checks, but it needs to be able to prove that they are redundant.

JuliaLang/julia#42573 adds the ability to hoist @inbounds in LLVM, and JuliaLang/julia#42692 is about making that happen more often. All this will only come down the pipe in 1.8, so manual hoisting is still the name of the game for now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Fun and exciting! I can't wait till the day I can stop writing @inbounds in my code altogether (or at least as much).


m0 = first(levels)
m1 = last(levels)
b = m0 - 1

@inbounds for i in eachindex(x)
@inbounds for i in eachindex(x, wv)
xi = x[i]
if m0 <= xi <= m1
r[xi - b] += wv[i]
Expand Down Expand Up @@ -112,8 +111,6 @@ proportions(x::IntegerArray, wv::AbstractWeights) = proportions(x, span(x), wv)
function addcounts!(r::AbstractArray, x::IntegerArray, y::IntegerArray, levels::NTuple{2,IntUnitRange})
# add counts of integers from x to r

@boundscheck( axes(x) == axes(y) ||
throw(DimensionMismatch("x and y must have the same axes")))

xlevels, ylevels = levels

Expand All @@ -128,7 +125,7 @@ function addcounts!(r::AbstractArray, x::IntegerArray, y::IntegerArray, levels::
bx = mx0 - 1
by = my0 - 1

for i = eachindex(x)
for i in eachindex(x, y)
LilithHafner marked this conversation as resolved.
Show resolved Hide resolved
xi = x[i]
yi = y[i]
if (mx0 <= xi <= mx1) && (my0 <= yi <= my1)
Expand All @@ -142,8 +139,6 @@ function addcounts!(r::AbstractArray, x::IntegerArray, y::IntegerArray,
levels::NTuple{2,IntUnitRange}, wv::AbstractWeights)
# add counts of integers from x to r

@boundscheck(axes(x) == axes(y) == axes(wv) ||
throw(DimensionMismatch("x, y, and wv must have the same axes")))

xlevels, ylevels = levels

Expand All @@ -157,7 +152,7 @@ function addcounts!(r::AbstractArray, x::IntegerArray, y::IntegerArray,
bx = mx0 - 1
by = my0 - 1

for i = eachindex(x)
for i in eachindex(x, y, wv)
xi = x[i]
yi = y[i]
if (mx0 <= xi <= mx1) && (my0 <= yi <= my1)
Expand Down Expand Up @@ -371,12 +366,10 @@ function addcounts_radixsort!(cm::Dict{T}, x) where T
end

function addcounts!(cm::Dict{T}, x::AbstractArray{T}, wv::AbstractVector{W}) where {T,W<:Real}
@boundscheck(axes(x) == axes(wv) ||
throw(DimensionMismatch("x and wv must have the same axes")))

LilithHafner marked this conversation as resolved.
Show resolved Hide resolved
z = zero(W)

for i = eachindex(x)
for i in eachindex(x, wv)
LilithHafner marked this conversation as resolved.
Show resolved Hide resolved
@inbounds xi = x[i]
@inbounds wi = wv[i]
cm[xi] = get(cm, xi, z) + wi
Expand Down