make counting more robust to input datatype #722

LilithHafner · 2021-10-02T18:55:02Z

Fixes #721
Fixes #796

nalimilan

Thanks! Could you add tests using OffsetArrays to ensure it works and we don't introduce regressions in the future? While you're at it, it would be nice to test the dict-based method too if it works.

src/counts.jl

…ttps://docs.julialang.org/en/v1/devdocs/offset-arrays/#Things-to-watch-out-for)

LilithHafner · 2021-10-07T01:31:55Z

Adding tests turned into a rabbit hole because I identified a few other areas where OffsetArrays lead to unsafe memory access, but I think I've got it all now, or at least a reasonably self contained chunk.

nalimilan

Thanks. Unfortunately I'm afraid the rabbit hole is even deeper. :-)

src/counts.jl

src/weights.jl

src/counts.jl

Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>

src/counts.jl

Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>

src/counts.jl

test/counts.jl

src/counts.jl

nalimilan · 2021-10-11T11:53:20Z

src/counts.jl

@@ -43,13 +43,12 @@ function addcounts!(r::AbstractArray, x::IntegerArray, levels::IntUnitRange, wv:
    # add wv weighted counts of integers from x that fall within levels to r

    @boundscheck checkbounds(r, axes(levels)...)
-    @boundscheck axes(x) == axes(wv) || throw(DimensionMismatch("x and wv must have the same axes"))


I think these are worth keeping (here and elsewhere) as they mention the names of the arguments whose dimensions don't match, which eachindex cannot do. Also in this particular case you've kept @inbounds so no check is done AFAICT.

I'm still thinking about this. I think the current error will look like this:

julia> countmap([1,2], [3,4,5]) ERROR: DimensionMismatch("all inputs to eachindex must have the same indices, got Base.OneTo(2) and Base.OneTo(3)") Stacktrace: [1] throw_eachindex_mismatch_indices(::IndexLinear, ::Base.OneTo{Int64}, ::Vararg{Base.OneTo{Int64}, N} where N) @ Base ./abstractarray.jl:260 [2] eachindex @ ./abstractarray.jl:316 [inlined] [3] eachindex @ ./abstractarray.jl:305 [inlined] [4] addcounts!(cm::Dict{Int64, Int64}, x::Vector{Int64}, wv::Vector{Int64}) @ StatsBase ~/.julia/dev/StatsBase/src/counts.jl:382 [5] countmap(x::Vector{Int64}, wv::Vector{Int64}) @ StatsBase ~/.julia/dev/StatsBase/src/counts.jl:415 [6] top-level scope @ none:1 julia> @inbounds eachindex([1,2], [3,4,5]) ERROR: DimensionMismatch("all inputs to eachindex must have the same indices, got Base.OneTo(2) and Base.OneTo(3)") Stacktrace: [1] throw_eachindex_mismatch_indices(::IndexLinear, ::Base.OneTo{Int64}, ::Vararg{Base.OneTo{Int64}, N} where N) @ Base ./abstractarray.jl:260 [2] eachindex @ ./abstractarray.jl:316 [inlined] [3] eachindex(A::Vector{Int64}, B::Vector{Int64}) @ Base ./abstractarray.jl:305 [4] top-level scope @ none:1

If someone is calling countmap directly in their code with axes(x) ≠ axes(wv), then the message "x and wv must have the same axes" is more helpful. If, on the other hand, the code causing the error is more removed (e.g.

julia> using Something help?> f f(a, b) does something and calls countmap(a,b) internally. julia> f([1,2],[3,4,5]) error...

) then "all inputs to eachindex must have the same indices, got Base.OneTo(2) and Base.OneTo(3)" might be easier to understand because it lists the indices. The question is then if we want to manually lift the error message a level to update the parameter names, to which I lean toward no.

It depends on how countmap is used.

I think in general it's always better to throw the error as soon as possible rather than letting it be thrown by another function deeper in the call stack. Otherwise it's hard to understand what happened (and you may even think StatsBase is buggy), e.g. here eachindex isn't the problem so it shouldn't appear in the error or the stack trace. (This is also why a common pattern when implementing custom array types is to use @boundschecks checkbounds(...); @inbounds ... so that the error is thrown by the custom array method rather than by the methods it calls internally.)

That said, it's also possible and nice for users to print the indices in the error message you throw.

Because dimension checking is not in an @boundscheck This would result in redundant dimension checking when the dimensions do match (the common case).

If we think about dimension checking (and bounds checking) as algorithmic operations that have a performance impact and should be performed only once for both style and performance reasons, then this is inelegant. This is how I currently think about bounds and dimension checking.

If we think about bounds and dimension checking like type assertions, a fun, effective, free, and ultimately optional way of formally commenting on code to help readers, error messages, and occasionally performance, then throwing in an extra dimension check is a good idea. This is how I would like to think about bounds and dimension checking, but I don't know if the compiler can elide redundant checks and the syntax for additional bounds and dimension checking is a lot less clean than ::Integer or a hypothetical ::InBounds.

I've only heard of this hoisting in @vchuravy's comment JuliaLang/julia#42521 (comment), and don't know where to look for details, feasibility & potential timelines, but it sounds enticing. It would be nice to obviate this whole discussion with a type error at the top of the call hierarchy.

This being a real package that people actually use, and dimension checking being cheap in practice (I think), makes me think that your suggestion of keeping the original errors (probably extended to follow eachindex's style of showing input values) is the best way to go pragmatically, even if there is theoretically a better way on the horizon or this way is technically suboptimal for now.

I've put back eager dimension checking and added error messages.

The compiler is generally good at removing redundant checks, but it needs to be able to prove that they are redundant.

JuliaLang/julia#42573 adds the ability to hoist @inbounds in LLVM, and JuliaLang/julia#42692 is about making that happen more often. All this will only come down the pipe in 1.8, so manual hoisting is still the name of the game for now.

Thanks! Fun and exciting! I can't wait till the day I can stop writing @inbounds in my code altogether (or at least as much).

src/counts.jl

…ts for compatability.

test/counts.jl

…e descriptions

Project.toml

src/counts.jl

nalimilan · 2021-10-14T15:42:48Z

src/counts.jl

-proportion in `x`.
+Return a dictionary mapping each unique value in `x` to its proportion in `x`.
+
+When `x` is a vector, a vector of weights `wv` can be provided and the sum of the weights


Suggested change

When `x` is a vector, a vector of weights `wv` can be provided and the sum of the weights

When `x` is a vector, a vector of weights `wv` can be provided and the proportion of the weights

I believe this is fixed

test/counts.jl

nalimilan · 2021-10-14T19:05:41Z

test/counts.jl

+    @test (countmap(x) == countmap(x; alg = :dict) == countmap(x; alg = :radixsort)
+        == countmap(y) == countmap(y; alg = :dict) == countmap(y; alg = :radixsort)
+        == countmap(z) == countmap(z; alg = :dict) == countmap(z; alg = :radixsort))


Suggested change

@test (countmap(x) == countmap(x; alg = :dict) == countmap(x; alg = :radixsort)

== countmap(y) == countmap(y; alg = :dict) == countmap(y; alg = :radixsort)

== countmap(z) == countmap(z; alg = :dict) == countmap(z; alg = :radixsort))

@test countmap(x) == countmap(x; alg = :dict) == countmap(x; alg = :radixsort) ==

countmap(y) == countmap(y; alg = :dict) == countmap(y; alg = :radixsort) ==

countmap(z) == countmap(z; alg = :dict) == countmap(z; alg = :radixsort)

I moved the == but kept the parentheses because a newline mistake would have the impact of silently skipping the test.

That's not possible. The worst outcome that could happen would be to get a syntax error. Please follow the style used elsewhere in the file.

I'm thinking about

using Test @test 1 == 1 == 1 == 1 2 == 1

when I say a newline mistake would silently skip tests.

I don't think any other expressions in the file span multiple lines. By "follow the style used elsewhere in the file" do you mean I should break this into multiple expressions?

I just suggest dropping the parentheses. We can't protect against any mistakes people might make in the future. Otherwise we would add to put parentheses even when multiple checks are on a single line just in case somebody adds a line break in the wrong place.

I really prefer the parentheses. ...But I don't care enough to find & cite or conduct statistical research about the frequency of silent line break bugs in Julia or other languages, nor to dig back up the paper I read a while ago which claimed operators at the start of a line are more readable, so we can just do it your way. ¯_(ツ)_/¯

Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>

bkamins · 2022-05-18T08:18:09Z

x-ref: #790 (comment) (as I think we need to make a general design decision about how we want to handle non-1-based indexing that next is consistently implemented in the package)

LilithHafner · 2022-05-18T14:04:53Z

I'd prefer throwing errors when passing arrays with mismatched indices...

Agreed. Ideally, any time two arrays (including array and weights) are supposed to line up, we should call eachindex(a, b).

An OffsetArray with nonzero offset is not compatible with an Array. A matrix is not compatible with a vector with the same number of elements.

...but there are some cases where we silently treated weight arrays as vectors whatever their dimensionality. So we have to preserve these, but in other cases we can be strict.

Sadly, yes.

Thinking about this again, maybe we should deprecate this behavior

Yay! That sounds great to me! Some options include

Restrict weighted addcounts! to AbstractVectors (your suggestion above reasonable)
Extend AbstractWeights to an AbstractArray but still add a depreciated fallback for backward compatibility (I don't know how weights tend to get used, but if they are often treated in parallel with AbstractArrays, then this makes sense)

I think we need to make a general design decision about how we want to handle non-1-based indexing that next is consistently implemented in the package.

I think the default choice for this decision is made language wide with eachindex. The issue is that this package already supports some things that eachindex does not.

LilithHafner · 2022-05-18T14:06:11Z

I propose that any time two arrays (including array and weights) are supposed to line up, we call eachindex(a, b). And we add @depricated methods to preserve backward compatibility.

…xpansion

nalimilan · 2022-06-02T21:08:48Z

Is there anything left to decide here?

LilithHafner · 2022-06-02T21:11:17Z

I believe everything is resolved except for deprecating mismatched indexes that were previously supported.
I think we can leave that to a separate PR. Here is an Issue to track it.

LilithHafner · 2022-06-06T11:01:48Z

Bump, I'd love to get this merged once folks are happy with it. It hits a lot of correctness issues and is already 8 months old.

bkamins · 2022-06-06T11:40:04Z

I am not a StatsBase.jl maintainer, but most likely maintainers are waiting for CI to pass before doing a final review.
What are the reasons of bouds errors for OffsetArrays.jl on nightly?

LilithHafner · 2022-06-06T12:06:22Z

Oh, I wasn't paying attention to those failed tests because they also fail on StatsBase master. I'll look into it.

bkamins · 2022-06-06T12:10:32Z

Ah - if this is unrelated then the failing tests should of course be fixed in a separate PR.

LilithHafner · 2022-06-06T12:32:28Z

There are unrelated failures and also failures due to tests added in this PR. This PR tests counting with OffsetArrays (though only on v1.9+) and counting uses SortingAlgorihtms.RadixSort which fails on some OffsetArrays. One solution would be to use Base.Sort.DEFAULT_UNSTABLE which should be reliable* and as of 1.9 is also as fast as SortingAlgorihtms.RadixSort and much faster in many common special cases (e.g. inputs shorter than 1000 elements).

We could switch from SortingAlgorithms.RadixSort to Base.Sorrt.DEFAULT_UNSTABLE for all Julia versions (+ correctness; + performance for collections < 500-1000 elements; - performance for large collections on julia < v1.9) or just for v1.9+. (see #796 for details)

*Coincidentally, Base.Sort.sort! is also broken for some OffsetArrays right now.

nalimilan · 2022-06-06T14:27:36Z

Ah yes it would be good to move to the new radix sort implementation in Base on Julia >= 1.9. But I'd keep SortingAlgorithms on older releases to avoid regressions (1.9 isn't even close to be released). We can just call Base.require_on_based_indexing on Julia < 1.9.

JuliaStats#796)

LilithHafner · 2022-06-06T15:54:13Z

I did that and also had to limit OffsetArrays tests to 1.9+ rather than 1.6+ because they now give an "addcounts_radixsort! requires either one based indexing or Julia 1.9. Use alg = :dict as an alternative." error. SortingAlgorithms.RadixSort only sometimes segfaults and the StatsBase teest cases don't hit those cases which is why they previously passed on 1.6.

LilithHafner · 2022-06-06T16:05:44Z

CI Tests now fail in the same way in this PR that they do in master. (something about random numbers)

nalimilan

OK, thanks. Sorry, I have one more request, as new code should really be tested.

src/counts.jl

Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>

src/counts.jl

nalimilan · 2022-06-08T07:39:49Z

Thanks!

LilithHafner · 2022-06-08T11:46:47Z

Wow! It actually happened! At 8 months and 8 participants, this is certainly the largest PR I've been a substantial part of. It builds my confidence in these systems of voluntary open-source code review to see that even if something takes this long, it still merges eventually if it's technically sound and provides a contribution that justifies the effort.

Thank you to everyone who helped see this through!

nalimilan · 2022-06-08T11:55:17Z

Glad you enjoyed it! :-p

LilithHafner · 2022-06-08T12:14:29Z

Now I know what you meant when you said

Thanks. Unfortunately I'm afraid the rabbit hole is even deeper. :-)

support offset arrays and simplify _addcounts_radix_sort_loop

3905f6e

nalimilan reviewed Oct 5, 2021

View reviewed changes

src/counts.jl Outdated Show resolved Hide resolved

add tests and fix more occurances of unsupported offset arrays [ref](h…

8eb5855

…ttps://docs.julialang.org/en/v1/devdocs/offset-arrays/#Things-to-watch-out-for)

nalimilan reviewed Oct 10, 2021

View reviewed changes

Apply suggestion from code review

84ef887

Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>

LilithHafner commented Oct 10, 2021

View reviewed changes

src/counts.jl Outdated Show resolved Hide resolved

LilithHafner commented Oct 10, 2021

View reviewed changes

src/counts.jl Show resolved Hide resolved

LilithHafner and others added 3 commits October 10, 2021 18:05

Replace dimension checking with varargs eachindex

d82295d

Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>

Test addcounts! on row vector

a4505b2

Support multidimensional arrays for :radixsort

c334d4d

LilithHafner commented Oct 10, 2021

View reviewed changes

src/counts.jl Outdated Show resolved Hide resolved

nalimilan reviewed Oct 11, 2021

View reviewed changes

mschauer reviewed Oct 11, 2021

View reviewed changes

src/counts.jl Outdated Show resolved Hide resolved

LilithHafner added 2 commits October 11, 2021 09:27

fix lastindex _addcounts_radix_sort_loop! indexing

94b3e6c

organize and extend testing; revert to flattening x when passed weigh…

87074b2

…ts for compatability.

LilithHafner commented Oct 11, 2021

View reviewed changes

test/counts.jl Show resolved Hide resolved

LilithHafner and others added 3 commits October 11, 2021 09:59

removed todo list

aa21bc9

Update src/counts.jl

309586f

test :radixsort multidimensional array

3b5e01d

LilithHafner changed the title ~~support offset arrays and simplify _addcounts_radix_sort_loop!~~ make counting more robust to input datatype Oct 11, 2021

LilithHafner and others added 5 commits October 11, 2021 16:46

docstrings: homogonize, correct, explain proportionmap, shrink onelin…

3384f45

…e descriptions

whitespace

6b81b51

test axes(::Weights)

18b70b5

Merge branch 'JuliaStats:master' into fix-offset-array

7aa2d79

fix tests

93caf72

LilithHafner commented Oct 14, 2021

View reviewed changes

Project.toml Outdated Show resolved Hide resolved

nalimilan reviewed Oct 14, 2021

View reviewed changes

LilithHafner and others added 2 commits October 14, 2021 16:18

Apply suggestions from code review

62c5967

Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>

put back dimension-mismatch (add messages)

1238a2d

LilithHafner and others added 2 commits May 19, 2022 12:54

Update src/counts.jl

fbdf564

use firstindex instead of 1 for robustness to future type signature e…

e1fab99

…xpansion

Switch from SortingAlgorithms to Base's radix sort in Julia 1.9+ (closes

717c795

JuliaStats#796)

nalimilan reviewed Jun 6, 2022

View reviewed changes

src/counts.jl Outdated Show resolved Hide resolved

src/counts.jl Outdated Show resolved Hide resolved

LilithHafner mentioned this pull request Jun 6, 2022

Low priority feature request: support proportionmap(x::AbstractArray; alg) #797

Open

LilithHafner commented Jun 6, 2022

View reviewed changes

src/counts.jl Outdated Show resolved Hide resolved

Apply suggestions from code review

ba59cc3

Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>

LilithHafner commented Jun 6, 2022

View reviewed changes

src/counts.jl Outdated Show resolved Hide resolved

Fix typo

2dfba9f

LilithHafner commented Jun 6, 2022

View reviewed changes

src/counts.jl Outdated Show resolved Hide resolved

LilithHafner and others added 2 commits June 6, 2022 17:06

Style

8f446ee

Minor fixes

e20e28c

nalimilan merged commit f0cccd6 into JuliaStats:master Jun 8, 2022

	When `x` is a vector, a vector of weights `wv` can be provided and the sum of the weights
	When `x` is a vector, a vector of weights `wv` can be provided and the proportion of the weights

make counting more robust to input datatype #722

make counting more robust to input datatype #722

Conversation

LilithHafner commented Oct 2, 2021 • edited Loading

nalimilan left a comment

Choose a reason for hiding this comment

LilithHafner commented Oct 7, 2021

nalimilan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bkamins commented May 18, 2022

LilithHafner commented May 18, 2022

LilithHafner commented May 18, 2022

nalimilan commented Jun 2, 2022

LilithHafner commented Jun 2, 2022 • edited Loading

LilithHafner commented Jun 6, 2022

bkamins commented Jun 6, 2022

LilithHafner commented Jun 6, 2022

bkamins commented Jun 6, 2022

LilithHafner commented Jun 6, 2022 • edited Loading

nalimilan commented Jun 6, 2022

LilithHafner commented Jun 6, 2022

LilithHafner commented Jun 6, 2022

nalimilan left a comment

Choose a reason for hiding this comment

nalimilan commented Jun 8, 2022

LilithHafner commented Jun 8, 2022

nalimilan commented Jun 8, 2022

LilithHafner commented Jun 8, 2022

LilithHafner commented Oct 2, 2021 •

edited

Loading

LilithHafner commented Jun 2, 2022 •

edited

Loading

LilithHafner commented Jun 6, 2022 •

edited

Loading