Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

make counting more robust to input datatype #722

Merged
merged 45 commits into from
Jun 8, 2022

Conversation

LilithHafner
Copy link
Contributor

@LilithHafner LilithHafner commented Oct 2, 2021

Fixes #721
Fixes #796

Copy link
Member

@nalimilan nalimilan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Could you add tests using OffsetArrays to ensure it works and we don't introduce regressions in the future? While you're at it, it would be nice to test the dict-based method too if it works.

src/counts.jl Outdated Show resolved Hide resolved
@LilithHafner
Copy link
Contributor Author

Adding tests turned into a rabbit hole because I identified a few other areas where OffsetArrays lead to unsafe memory access, but I think I've got it all now, or at least a reasonably self contained chunk.

Copy link
Member

@nalimilan nalimilan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Unfortunately I'm afraid the rabbit hole is even deeper. :-)

src/counts.jl Outdated Show resolved Hide resolved
src/counts.jl Outdated Show resolved Hide resolved
src/counts.jl Outdated Show resolved Hide resolved
src/weights.jl Show resolved Hide resolved
src/counts.jl Outdated Show resolved Hide resolved
src/counts.jl Outdated Show resolved Hide resolved
src/counts.jl Outdated Show resolved Hide resolved
src/counts.jl Outdated Show resolved Hide resolved
src/counts.jl Outdated Show resolved Hide resolved
src/counts.jl Outdated Show resolved Hide resolved
Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
src/counts.jl Outdated Show resolved Hide resolved
src/counts.jl Outdated Show resolved Hide resolved
test/counts.jl Outdated Show resolved Hide resolved
src/counts.jl Outdated Show resolved Hide resolved
src/counts.jl Show resolved Hide resolved
src/counts.jl Outdated Show resolved Hide resolved
src/counts.jl Outdated Show resolved Hide resolved
src/counts.jl Outdated
@@ -43,13 +43,12 @@ function addcounts!(r::AbstractArray, x::IntegerArray, levels::IntUnitRange, wv:
# add wv weighted counts of integers from x that fall within levels to r

@boundscheck checkbounds(r, axes(levels)...)
@boundscheck axes(x) == axes(wv) || throw(DimensionMismatch("x and wv must have the same axes"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these are worth keeping (here and elsewhere) as they mention the names of the arguments whose dimensions don't match, which eachindex cannot do. Also in this particular case you've kept @inbounds so no check is done AFAICT.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still thinking about this. I think the current error will look like this:

julia> countmap([1,2], [3,4,5])
ERROR: DimensionMismatch("all inputs to eachindex must have the same indices, got Base.OneTo(2) and Base.OneTo(3)")
Stacktrace:
 [1] throw_eachindex_mismatch_indices(::IndexLinear, ::Base.OneTo{Int64}, ::Vararg{Base.OneTo{Int64}, N} where N)
   @ Base ./abstractarray.jl:260
 [2] eachindex
   @ ./abstractarray.jl:316 [inlined]
 [3] eachindex
   @ ./abstractarray.jl:305 [inlined]
 [4] addcounts!(cm::Dict{Int64, Int64}, x::Vector{Int64}, wv::Vector{Int64})
   @ StatsBase ~/.julia/dev/StatsBase/src/counts.jl:382
 [5] countmap(x::Vector{Int64}, wv::Vector{Int64})
   @ StatsBase ~/.julia/dev/StatsBase/src/counts.jl:415
 [6] top-level scope
   @ none:1

julia> @inbounds eachindex([1,2], [3,4,5])
ERROR: DimensionMismatch("all inputs to eachindex must have the same indices, got Base.OneTo(2) and Base.OneTo(3)")
Stacktrace:
 [1] throw_eachindex_mismatch_indices(::IndexLinear, ::Base.OneTo{Int64}, ::Vararg{Base.OneTo{Int64}, N} where N)
   @ Base ./abstractarray.jl:260
 [2] eachindex
   @ ./abstractarray.jl:316 [inlined]
 [3] eachindex(A::Vector{Int64}, B::Vector{Int64})
   @ Base ./abstractarray.jl:305
 [4] top-level scope
   @ none:1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If someone is calling countmap directly in their code with axes(x) ≠ axes(wv), then the message "x and wv must have the same axes" is more helpful. If, on the other hand, the code causing the error is more removed (e.g.

julia> using Something
help?> f
    f(a, b)
does something and calls countmap(a,b) internally.
julia> f([1,2],[3,4,5])
error...

) then "all inputs to eachindex must have the same indices, got Base.OneTo(2) and Base.OneTo(3)" might be easier to understand because it lists the indices. The question is then if we want to manually lift the error message a level to update the parameter names, to which I lean toward no.

It depends on how countmap is used.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in general it's always better to throw the error as soon as possible rather than letting it be thrown by another function deeper in the call stack. Otherwise it's hard to understand what happened (and you may even think StatsBase is buggy), e.g. here eachindex isn't the problem so it shouldn't appear in the error or the stack trace. (This is also why a common pattern when implementing custom array types is to use @boundschecks checkbounds(...); @inbounds ... so that the error is thrown by the custom array method rather than by the methods it calls internally.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That said, it's also possible and nice for users to print the indices in the error message you throw.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because dimension checking is not in an @boundscheck This would result in redundant dimension checking when the dimensions do match (the common case).

If we think about dimension checking (and bounds checking) as algorithmic operations that have a performance impact and should be performed only once for both style and performance reasons, then this is inelegant. This is how I currently think about bounds and dimension checking.

If we think about bounds and dimension checking like type assertions, a fun, effective, free, and ultimately optional way of formally commenting on code to help readers, error messages, and occasionally performance, then throwing in an extra dimension check is a good idea. This is how I would like to think about bounds and dimension checking, but I don't know if the compiler can elide redundant checks and the syntax for additional bounds and dimension checking is a lot less clean than ::Integer or a hypothetical ::InBounds.

I've only heard of this hoisting in @vchuravy's comment JuliaLang/julia#42521 (comment), and don't know where to look for details, feasibility & potential timelines, but it sounds enticing. It would be nice to obviate this whole discussion with a type error at the top of the call hierarchy.

This being a real package that people actually use, and dimension checking being cheap in practice (I think), makes me think that your suggestion of keeping the original errors (probably extended to follow eachindex's style of showing input values) is the best way to go pragmatically, even if there is theoretically a better way on the horizon or this way is technically suboptimal for now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've put back eager dimension checking and added error messages.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The compiler is generally good at removing redundant checks, but it needs to be able to prove that they are redundant.

JuliaLang/julia#42573 adds the ability to hoist @inbounds in LLVM, and JuliaLang/julia#42692 is about making that happen more often. All this will only come down the pipe in 1.8, so manual hoisting is still the name of the game for now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Fun and exciting! I can't wait till the day I can stop writing @inbounds in my code altogether (or at least as much).

src/counts.jl Outdated Show resolved Hide resolved
src/counts.jl Outdated Show resolved Hide resolved
@LilithHafner LilithHafner changed the title support offset arrays and simplify _addcounts_radix_sort_loop! make counting more robust to input datatype Oct 11, 2021
Project.toml Outdated Show resolved Hide resolved
Project.toml Outdated Show resolved Hide resolved
Project.toml Outdated Show resolved Hide resolved
src/counts.jl Outdated Show resolved Hide resolved
src/counts.jl Outdated Show resolved Hide resolved
src/counts.jl Outdated
proportion in `x`.
Return a dictionary mapping each unique value in `x` to its proportion in `x`.

When `x` is a vector, a vector of weights `wv` can be provided and the sum of the weights
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
When `x` is a vector, a vector of weights `wv` can be provided and the sum of the weights
When `x` is a vector, a vector of weights `wv` can be provided and the proportion of the weights

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is fixed

test/counts.jl Outdated Show resolved Hide resolved
test/counts.jl Outdated Show resolved Hide resolved
test/counts.jl Outdated Show resolved Hide resolved
test/counts.jl Outdated
Comment on lines 196 to 198
@test (countmap(x) == countmap(x; alg = :dict) == countmap(x; alg = :radixsort)
== countmap(y) == countmap(y; alg = :dict) == countmap(y; alg = :radixsort)
== countmap(z) == countmap(z; alg = :dict) == countmap(z; alg = :radixsort))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
@test (countmap(x) == countmap(x; alg = :dict) == countmap(x; alg = :radixsort)
== countmap(y) == countmap(y; alg = :dict) == countmap(y; alg = :radixsort)
== countmap(z) == countmap(z; alg = :dict) == countmap(z; alg = :radixsort))
@test countmap(x) == countmap(x; alg = :dict) == countmap(x; alg = :radixsort) ==
countmap(y) == countmap(y; alg = :dict) == countmap(y; alg = :radixsort) ==
countmap(z) == countmap(z; alg = :dict) == countmap(z; alg = :radixsort)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved the == but kept the parentheses because a newline mistake would have the impact of silently skipping the test.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's not possible. The worst outcome that could happen would be to get a syntax error. Please follow the style used elsewhere in the file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking about

using Test
@test 1 == 1 ==
      1 == 1 
      2 == 1

when I say a newline mistake would silently skip tests.

I don't think any other expressions in the file span multiple lines. By "follow the style used elsewhere in the file" do you mean I should break this into multiple expressions?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just suggest dropping the parentheses. We can't protect against any mistakes people might make in the future. Otherwise we would add to put parentheses even when multiple checks are on a single line just in case somebody adds a line break in the wrong place.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really prefer the parentheses. ...But I don't care enough to find & cite or conduct statistical research about the frequency of silent line break bugs in Julia or other languages, nor to dig back up the paper I read a while ago which claimed operators at the start of a line are more readable, so we can just do it your way. ¯_(ツ)_/¯

LilithHafner and others added 2 commits October 14, 2021 16:18
Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
@bkamins
Copy link
Contributor

bkamins commented May 18, 2022

x-ref: #790 (comment) (as I think we need to make a general design decision about how we want to handle non-1-based indexing that next is consistently implemented in the package)

@LilithHafner
Copy link
Contributor Author

I'd prefer throwing errors when passing arrays with mismatched indices...

Agreed. Ideally, any time two arrays (including array and weights) are supposed to line up, we should call eachindex(a, b).

An OffsetArray with nonzero offset is not compatible with an Array. A matrix is not compatible with a vector with the same number of elements.

...but there are some cases where we silently treated weight arrays as vectors whatever their dimensionality. So we have to preserve these, but in other cases we can be strict.

Sadly, yes.

Thinking about this again, maybe we should deprecate this behavior

Yay! That sounds great to me! Some options include

  1. Restrict weighted addcounts! to AbstractVectors (your suggestion above reasonable)
  2. Extend AbstractWeights to an AbstractArray but still add a depreciated fallback for backward compatibility (I don't know how weights tend to get used, but if they are often treated in parallel with AbstractArrays, then this makes sense)

I think we need to make a general design decision about how we want to handle non-1-based indexing that next is consistently implemented in the package.

I think the default choice for this decision is made language wide with eachindex. The issue is that this package already supports some things that eachindex does not.

@LilithHafner
Copy link
Contributor Author

I propose that any time two arrays (including array and weights) are supposed to line up, we call eachindex(a, b). And we add @depricated methods to preserve backward compatibility.

@nalimilan
Copy link
Member

Is there anything left to decide here?

@LilithHafner
Copy link
Contributor Author

LilithHafner commented Jun 2, 2022

I believe everything is resolved except for deprecating mismatched indexes that were previously supported.
I think we can leave that to a separate PR. Here is an Issue to track it.

@LilithHafner
Copy link
Contributor Author

Bump, I'd love to get this merged once folks are happy with it. It hits a lot of correctness issues and is already 8 months old.

@bkamins
Copy link
Contributor

bkamins commented Jun 6, 2022

I am not a StatsBase.jl maintainer, but most likely maintainers are waiting for CI to pass before doing a final review.
What are the reasons of bouds errors for OffsetArrays.jl on nightly?

@LilithHafner
Copy link
Contributor Author

Oh, I wasn't paying attention to those failed tests because they also fail on StatsBase master. I'll look into it.

@bkamins
Copy link
Contributor

bkamins commented Jun 6, 2022

Ah - if this is unrelated then the failing tests should of course be fixed in a separate PR.

@LilithHafner
Copy link
Contributor Author

LilithHafner commented Jun 6, 2022

There are unrelated failures and also failures due to tests added in this PR. This PR tests counting with OffsetArrays (though only on v1.9+) and counting uses SortingAlgorihtms.RadixSort which fails on some OffsetArrays. One solution would be to use Base.Sort.DEFAULT_UNSTABLE which should be reliable* and as of 1.9 is also as fast as SortingAlgorihtms.RadixSort and much faster in many common special cases (e.g. inputs shorter than 1000 elements).

We could switch from SortingAlgorithms.RadixSort to Base.Sorrt.DEFAULT_UNSTABLE for all Julia versions (+ correctness; + performance for collections < 500-1000 elements; - performance for large collections on julia < v1.9) or just for v1.9+. (see #796 for details)

*Coincidentally, Base.Sort.sort! is also broken for some OffsetArrays right now.

@nalimilan
Copy link
Member

Ah yes it would be good to move to the new radix sort implementation in Base on Julia >= 1.9. But I'd keep SortingAlgorithms on older releases to avoid regressions (1.9 isn't even close to be released). We can just call Base.require_on_based_indexing on Julia < 1.9.

@LilithHafner
Copy link
Contributor Author

I did that and also had to limit OffsetArrays tests to 1.9+ rather than 1.6+ because they now give an "addcounts_radixsort! requires either one based indexing or Julia 1.9. Use alg = :dict as an alternative." error. SortingAlgorithms.RadixSort only sometimes segfaults and the StatsBase teest cases don't hit those cases which is why they previously passed on 1.6.

@LilithHafner
Copy link
Contributor Author

CI Tests now fail in the same way in this PR that they do in master. (something about random numbers)

Copy link
Member

@nalimilan nalimilan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, thanks. Sorry, I have one more request, as new code should really be tested.

src/counts.jl Outdated Show resolved Hide resolved
src/counts.jl Outdated Show resolved Hide resolved
src/counts.jl Outdated Show resolved Hide resolved
Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
src/counts.jl Outdated Show resolved Hide resolved
src/counts.jl Outdated Show resolved Hide resolved
@nalimilan nalimilan merged commit f0cccd6 into JuliaStats:master Jun 8, 2022
@nalimilan
Copy link
Member

Thanks!

@LilithHafner
Copy link
Contributor Author

Wow! It actually happened! At 8 months and 8 participants, this is certainly the largest PR I've been a substantial part of. It builds my confidence in these systems of voluntary open-source code review to see that even if something takes this long, it still merges eventually if it's technically sound and provides a contribution that justifies the effort.

Thank you to everyone who helped see this through!

@nalimilan
Copy link
Member

Glad you enjoyed it! :-p

@LilithHafner
Copy link
Contributor Author

Now I know what you meant when you said

Thanks. Unfortunately I'm afraid the rabbit hole is even deeper. :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Stop using SortingAlgorithms.RadixSort in count.jl _addcounts_radix_sort_loop! unsafe for OffsetVectors
8 participants