-
Notifications
You must be signed in to change notification settings - Fork 193
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
make counting more robust to input datatype #722
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Could you add tests using OffsetArrays to ensure it works and we don't introduce regressions in the future? While you're at it, it would be nice to test the dict-based method too if it works.
Adding tests turned into a rabbit hole because I identified a few other areas where |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Unfortunately I'm afraid the rabbit hole is even deeper. :-)
Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
src/counts.jl
Outdated
@@ -43,13 +43,12 @@ function addcounts!(r::AbstractArray, x::IntegerArray, levels::IntUnitRange, wv: | |||
# add wv weighted counts of integers from x that fall within levels to r | |||
|
|||
@boundscheck checkbounds(r, axes(levels)...) | |||
@boundscheck axes(x) == axes(wv) || throw(DimensionMismatch("x and wv must have the same axes")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these are worth keeping (here and elsewhere) as they mention the names of the arguments whose dimensions don't match, which eachindex
cannot do. Also in this particular case you've kept @inbounds
so no check is done AFAICT.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still thinking about this. I think the current error will look like this:
julia> countmap([1,2], [3,4,5])
ERROR: DimensionMismatch("all inputs to eachindex must have the same indices, got Base.OneTo(2) and Base.OneTo(3)")
Stacktrace:
[1] throw_eachindex_mismatch_indices(::IndexLinear, ::Base.OneTo{Int64}, ::Vararg{Base.OneTo{Int64}, N} where N)
@ Base ./abstractarray.jl:260
[2] eachindex
@ ./abstractarray.jl:316 [inlined]
[3] eachindex
@ ./abstractarray.jl:305 [inlined]
[4] addcounts!(cm::Dict{Int64, Int64}, x::Vector{Int64}, wv::Vector{Int64})
@ StatsBase ~/.julia/dev/StatsBase/src/counts.jl:382
[5] countmap(x::Vector{Int64}, wv::Vector{Int64})
@ StatsBase ~/.julia/dev/StatsBase/src/counts.jl:415
[6] top-level scope
@ none:1
julia> @inbounds eachindex([1,2], [3,4,5])
ERROR: DimensionMismatch("all inputs to eachindex must have the same indices, got Base.OneTo(2) and Base.OneTo(3)")
Stacktrace:
[1] throw_eachindex_mismatch_indices(::IndexLinear, ::Base.OneTo{Int64}, ::Vararg{Base.OneTo{Int64}, N} where N)
@ Base ./abstractarray.jl:260
[2] eachindex
@ ./abstractarray.jl:316 [inlined]
[3] eachindex(A::Vector{Int64}, B::Vector{Int64})
@ Base ./abstractarray.jl:305
[4] top-level scope
@ none:1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If someone is calling countmap
directly in their code with axes(x) ≠ axes(wv)
, then the message "x and wv must have the same axes" is more helpful. If, on the other hand, the code causing the error is more removed (e.g.
julia> using Something
help?> f
f(a, b)
does something and calls countmap(a,b) internally.
julia> f([1,2],[3,4,5])
error...
) then "all inputs to eachindex must have the same indices, got Base.OneTo(2) and Base.OneTo(3)" might be easier to understand because it lists the indices. The question is then if we want to manually lift the error message a level to update the parameter names, to which I lean toward no.
It depends on how countmap
is used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think in general it's always better to throw the error as soon as possible rather than letting it be thrown by another function deeper in the call stack. Otherwise it's hard to understand what happened (and you may even think StatsBase is buggy), e.g. here eachindex
isn't the problem so it shouldn't appear in the error or the stack trace. (This is also why a common pattern when implementing custom array types is to use @boundschecks checkbounds(...); @inbounds ...
so that the error is thrown by the custom array method rather than by the methods it calls internally.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That said, it's also possible and nice for users to print the indices in the error message you throw.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because dimension checking is not in an @boundscheck
This would result in redundant dimension checking when the dimensions do match (the common case).
If we think about dimension checking (and bounds checking) as algorithmic operations that have a performance impact and should be performed only once for both style and performance reasons, then this is inelegant. This is how I currently think about bounds and dimension checking.
If we think about bounds and dimension checking like type assertions, a fun, effective, free, and ultimately optional way of formally commenting on code to help readers, error messages, and occasionally performance, then throwing in an extra dimension check is a good idea. This is how I would like to think about bounds and dimension checking, but I don't know if the compiler can elide redundant checks and the syntax for additional bounds and dimension checking is a lot less clean than ::Integer
or a hypothetical ::InBounds
.
I've only heard of this hoisting in @vchuravy's comment JuliaLang/julia#42521 (comment), and don't know where to look for details, feasibility & potential timelines, but it sounds enticing. It would be nice to obviate this whole discussion with a type error at the top of the call hierarchy.
This being a real package that people actually use, and dimension checking being cheap in practice (I think), makes me think that your suggestion of keeping the original errors (probably extended to follow eachindex
's style of showing input values) is the best way to go pragmatically, even if there is theoretically a better way on the horizon or this way is technically suboptimal for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've put back eager dimension checking and added error messages.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The compiler is generally good at removing redundant checks, but it needs to be able to prove that they are redundant.
JuliaLang/julia#42573 adds the ability to hoist @inbounds
in LLVM, and JuliaLang/julia#42692 is about making that happen more often. All this will only come down the pipe in 1.8, so manual hoisting is still the name of the game for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Fun and exciting! I can't wait till the day I can stop writing @inbounds in my code altogether (or at least as much).
src/counts.jl
Outdated
proportion in `x`. | ||
Return a dictionary mapping each unique value in `x` to its proportion in `x`. | ||
|
||
When `x` is a vector, a vector of weights `wv` can be provided and the sum of the weights |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When `x` is a vector, a vector of weights `wv` can be provided and the sum of the weights | |
When `x` is a vector, a vector of weights `wv` can be provided and the proportion of the weights |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this is fixed
test/counts.jl
Outdated
@test (countmap(x) == countmap(x; alg = :dict) == countmap(x; alg = :radixsort) | ||
== countmap(y) == countmap(y; alg = :dict) == countmap(y; alg = :radixsort) | ||
== countmap(z) == countmap(z; alg = :dict) == countmap(z; alg = :radixsort)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@test (countmap(x) == countmap(x; alg = :dict) == countmap(x; alg = :radixsort) | |
== countmap(y) == countmap(y; alg = :dict) == countmap(y; alg = :radixsort) | |
== countmap(z) == countmap(z; alg = :dict) == countmap(z; alg = :radixsort)) | |
@test countmap(x) == countmap(x; alg = :dict) == countmap(x; alg = :radixsort) == | |
countmap(y) == countmap(y; alg = :dict) == countmap(y; alg = :radixsort) == | |
countmap(z) == countmap(z; alg = :dict) == countmap(z; alg = :radixsort) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved the ==
but kept the parentheses because a newline mistake would have the impact of silently skipping the test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's not possible. The worst outcome that could happen would be to get a syntax error. Please follow the style used elsewhere in the file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm thinking about
using Test
@test 1 == 1 ==
1 == 1
2 == 1
when I say a newline mistake would silently skip tests.
I don't think any other expressions in the file span multiple lines. By "follow the style used elsewhere in the file" do you mean I should break this into multiple expressions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just suggest dropping the parentheses. We can't protect against any mistakes people might make in the future. Otherwise we would add to put parentheses even when multiple checks are on a single line just in case somebody adds a line break in the wrong place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really prefer the parentheses. ...But I don't care enough to find & cite or conduct statistical research about the frequency of silent line break bugs in Julia or other languages, nor to dig back up the paper I read a while ago which claimed operators at the start of a line are more readable, so we can just do it your way. ¯_(ツ)_/¯
Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
x-ref: #790 (comment) (as I think we need to make a general design decision about how we want to handle non-1-based indexing that next is consistently implemented in the package) |
Agreed. Ideally, any time two arrays (including array and weights) are supposed to line up, we should call An OffsetArray with nonzero offset is not compatible with an Array. A matrix is not compatible with a vector with the same number of elements.
Sadly, yes.
Yay! That sounds great to me! Some options include
I think the default choice for this decision is made language wide with |
I propose that any time two arrays (including array and weights) are supposed to line up, we call |
Is there anything left to decide here? |
I believe everything is resolved except for deprecating mismatched indexes that were previously supported. |
Bump, I'd love to get this merged once folks are happy with it. It hits a lot of correctness issues and is already 8 months old. |
I am not a StatsBase.jl maintainer, but most likely maintainers are waiting for CI to pass before doing a final review. |
Oh, I wasn't paying attention to those failed tests because they also fail on StatsBase master. I'll look into it. |
Ah - if this is unrelated then the failing tests should of course be fixed in a separate PR. |
There are unrelated failures and also failures due to tests added in this PR. This PR tests counting with OffsetArrays (though only on v1.9+) and counting uses SortingAlgorihtms.RadixSort which fails on some OffsetArrays. One solution would be to use We could switch from *Coincidentally, Base.Sort.sort! is also broken for some OffsetArrays right now. |
Ah yes it would be good to move to the new radix sort implementation in Base on Julia >= 1.9. But I'd keep SortingAlgorithms on older releases to avoid regressions (1.9 isn't even close to be released). We can just call |
I did that and also had to limit OffsetArrays tests to 1.9+ rather than 1.6+ because they now give an "addcounts_radixsort! requires either one based indexing or Julia 1.9. Use |
CI Tests now fail in the same way in this PR that they do in master. (something about random numbers) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, thanks. Sorry, I have one more request, as new code should really be tested.
Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
Thanks! |
Wow! It actually happened! At 8 months and 8 participants, this is certainly the largest PR I've been a substantial part of. It builds my confidence in these systems of voluntary open-source code review to see that even if something takes this long, it still merges eventually if it's technically sound and provides a contribution that justifies the effort. Thank you to everyone who helped see this through! |
Glad you enjoyed it! :-p |
Now I know what you meant when you said
|
Fixes #721
Fixes #796