Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make pairwise multi-threaded and make corkendall and corspearman wrap pairwise. #923

Open
wants to merge 24 commits into
base: master
Choose a base branch
from

Conversation

PGS62
Copy link
Contributor

@PGS62 PGS62 commented Mar 31, 2024

This pull request follows up on #849. In that discussion I envisaged amending corkendall to be multi-threaded. Instead I have done the following:

  1. Made pairwise and pairwise! multi-threaded, and in the case of skipmissing = :pairwise, greatly reduced allocations which is needed for good multi-threaded performance. A nice illustration of the performance improvements this leads to is obtained with f = LinearAlgebra.dot. The example here shows a 31.5 times speedup on a 12 core PC.
  2. corkendall now simply wraps pairwise but with added methods of _pairwise! for typeof(corkendall) to retain the speedups to corkendall that were a result of Speeding up Kendall Correlation corkendall #634.
  3. corspearman also wraps pairwise with added methods of _pairwise! for typeof(corspearman). The purpose of these added methods is to reduce the number of calls to tiedrank from 2nm to 2(n+m) when n and m are respectively the number of columns in the arguments x and y to the call to corspearman(x,y....

Since corkendall and corspearman now wrap pairwise they have a skipmissing argument, which introduces an additional inconsistency between corkandall/corspearman and cov/cor.

I developed the new code in my package KendallTau. Please see the README file for further examples of the performance improvements achieved, which for all but the smallest input data exceed the number of cores on the PC (thanks to reduced allocations).

I have amended the tests as needed, and I believe that test coverage is as close to 100% as possible.

I apologise that there is quite a lot for maintainers to review, and I'll be happy to answer questions.

Copy link
Member

@andreasnoack andreasnoack left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really great. Thanks for contributing such a PR. I'm happy with it but lets keep if open for a couple of days to see if anybody else would like to comment.

test/pairwise.jl Outdated Show resolved Hide resolved
src/rankcorr.jl Outdated
Xjrank = tiedrank(Xj)
C[j,1] = cor(Xjrank, yrank)
function corspearman(x::AbstractMatrix, y::AbstractVector; skipmissing::Symbol=:none)
return corspearman(x, reshape(y, (length(y), 1)); skipmissing)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the 1.0 tests are failing because of this way of passing the keyword argument. I'd be okay with bumping the minimum version for StatsBase since 1.0 is ancient but maybe it's better to just make it skipmissing=skipmissing here for now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When writing and testing the code (in KendallTau) I was unaware that support for Julia 1.0.5 was necessary, so the test failures when I submitted the PR came as a surprise. Changing how keyword arguments are handled would be easy, as would avoiding trailing backslashes within string literals. But the fact that 1.0.5 does not support eachcol is a bit more difficult. eachcol was added in 1.1. For the time being I will work on responding to @devmotion's comments and wait for your guidance as to whether bumping the minimum version might happen.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've done some more investigation:
Other popular packages (the compat section of Project.toml) have the following minimum versions:
1.9.4 for Statistics
1.6 for Plots, DataFrames, Revise, Optim, ForwardDiff, OhMyREPL, CSV
1.3 for Distributions
1 for StatsBase
So upping the minimum version for StatsBase seems sensible.

But as I remarked above, eachcol is crucial to how corkendall and corspearman are mere wrappers to pairwise. eachcol was introduced in Julia 1.1, so I imagined that upping the minimum Julia version for StatsBase to 1.1 would suffice. But no - the code appears to require Julia 1.9.4, and as of now does not work with Julia 1.8.5. For example the method check_vectors fails because:

# Julia 1.9.4
julia> keys((eachcol([1 2;3 4]))[2])
2-element LinearIndices{1, Tuple{Base.OneTo{Int64}}}:
 1
 2
# Julia 1.8.5
julia> keys((eachcol([1 2;3 4]))[2])
ERROR: MethodError: no method matching getindex(::Base.Generator{Base.OneTo{Int64}, Base.var"#242#243"{Matrix{Int64}}}, ::Int64)
Stacktrace:
 [1] top-level scope
   @ REPL[1]:1

So the alternatives are:

  • A. We increase the minimum version to 1.9.4, which happens to be that for Statistics, but is more modern than the current LTS.
  • B. I find a workaround so that this PR requires either no increase in the minimum version number or an increase to 1.6 which is more in line with other popular packages.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's now the case that increasing the minimum version to 1.6 would work. See here.

.vscode/settings.json Outdated Show resolved Hide resolved
src/pairwise.jl Outdated
Comment on lines 14 to 15
#cor and friends are special-cased.
iscor = (f in (corkendall, corspearman, cor))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a bit unfortunate that this means other functions can't benefit from the improvements (?) since it is not extendable. Maybe an additional argument would be better?

Copy link
Contributor Author

@PGS62 PGS62 Apr 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can I check that you've understood the purpose of the iscor variable (which might be better named diag_is_1)?

It's a Boolean that answers the question "Does x === y imply f(x,y) == 1.0, irrespective of NaN, missing and Inf entries, and that therefore there is a speedup to be had by not bothering to call f when x[i] === y[i]".

The upshot is that if f was some new correlation estimator for which the answer to the question above was "yes" but where we failed to update line 15 then we'd miss out on the "diagonal elements are near-zero-cost to calculate" but still have the benefit of multithreading. Of course, an additional argument is possible, but it makes the interface to pairwise a bit more complicated.

There's also the fact that the author of funky_cor could simply include the line x===y && return 1.0 in its definition, which remark has got me wondering whether I should do that for corspearman and corkendall...

src/pairwise.jl Outdated Show resolved Hide resolved
src/pairwise.jl Outdated Show resolved Hide resolved
src/pairwise.jl Show resolved Hide resolved
src/pairwise.jl Show resolved Hide resolved
src/pairwise.jl Show resolved Hide resolved
src/pairwise.jl Show resolved Hide resolved
src/pairwise.jl Outdated Show resolved Hide resolved
src/pairwise.jl Outdated
[26, 25, 16, 15, 6, 5]
```
"""
function equal_sum_subsets(n::Int, num_subsets::Int)::Vector{Vector{Int}}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case this function is kept (see comments above), is it necessary to return a vector of vectors or could one return an iterator of vectors instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm reasonably sure that equal_sum_subsets works well in the sense of doing a good job of load balancing. The only reason it currently returns a vector of vectors is that's what I know how to do, so I would need to read up on iterators.

Copy link
Contributor Author

@PGS62 PGS62 Apr 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In commit a1d9ae9 I have replaced the function equal_sum_subsets with an iterator EqualSumSubsets whose element type is another iterator TwoStepRange. It seems to work well and leads to a small reduction in allocations. But please review as this is the first time I've written code to create an iterator.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, didn't work on 32 bit. Maybe I need to replace use of Int64 with Int?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commit 951b391 fixes EqualSumSubsets on 32-bit.

@PGS62
Copy link
Contributor Author

PGS62 commented Apr 5, 2024

I've now made changes so that pairwise, corkendall and corspearman work with all tests passing on Julia 1.6 or later. I've also changed ci.yml to test against 1.6 instead of 1.0 and the compat section of Project.toml now puts the minimum Julia version as 1.6.

Would this meet with approval from the StatsBase maintainers?

@PGS62
Copy link
Contributor Author

PGS62 commented Apr 10, 2024

The failed checks after commit [46d5655] are all in the codecov step, with error:

Commit creating failed: {"detail":"Tokenless has reached GitHub rate limit. Please upload using a token: https://docs.codecov.com/docs/adding-the-codecov-token. Expected available in 1549 seconds."}

IIUC (this page) I need to wait and try again.

@PGS62
Copy link
Contributor Author

PGS62 commented May 28, 2024

It's been a while... May I ask why this PR has not been merged? Is it:
a) An oversight?
b) Concerns about upping the minimum Julia version to 1.6?
c) Something else?

@frankier
Copy link

I know people have many obligations and PRs/emails get ignored sometimes, but this is a bit of an unfortunate situation. @PGS62 was rebuffed from registering a new package, but now this PR sits languishing.

Copy link
Member

@nalimilan nalimilan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay! This is a significant refactoring so there are many things to discuss. My main remarks are (see details in inline comments):

  • I think the threading strategy can be simplified a lot. Task local storage shouldn't be needed (it's almost never the case), using @spawn is the recommended pattern and it composes with nested threaded calls contrary to @threads. Also I wonder whether we can find a simpler solution of splitting indices across tasks than EqualSumSubsets (see here).
  • I wouldn't add a skipmissing argument to corkendall and corspeaman, as this makes the code significantly more verbose and the signature inconsistent with cor. We have pairwise for that. (And anyway it's better to make separate PRs for distinct features as in my experience large PRs are more likely to become unmanageable and stall.)
  • There are lots of unrelated formatting changes, please revert these so that the diff is cleaner and easier to review. Also in the changed code the style isn't always consistent with StatsBase's (in particular, arguments should be aligned in method definitions, no empty initial lines, space after # in comments).


#cor and friends are special-cased.
diag_is_1 = (f in (corkendall, corspearman, cor))
(diag_is_1 || f == cov) && (symmetric = x === y)
Copy link
Member

@nalimilan nalimilan Oct 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
(diag_is_1 || f == cov) && (symmetric = x === y)
if (diag_is_1 || f === cov) && x === y
symmetric = true
end

diag_is_1 = (f in (corkendall, corspearman, cor))
(diag_is_1 || f == cov) && (symmetric = x === y)
#cov(x) is faster than cov(x, x)
(f == cov) && (f = ((x, y) -> x === y ? cov(x) : cov(x, y)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
(f == cov) && (f = ((x, y) -> x === y ? cov(x) : cov(x, y)))
if f === cov
f = (x, y) -> x === y ? cov(x) : cov(x, y)
end

end
end
symmetric && LinearAlgebra.copytri!(dest, 'L')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAICT this is an internal function, better copy using an explicit loop as before. Anyway that function just does that, without any particular optimization.

(f == cov) && (f = ((x, y) -> x === y ? cov(x) : cov(x, y)))

Threads.@threads for subset in EqualSumSubsets(nc, Threads.nthreads())
for j in subset
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
for j in subset
@inbounds for j in subset

Comment on lines +1 to +2
function _pairwise!(::Val{:none}, f, dest::AbstractMatrix{V}, x, y,
symmetric::Bool) where {V}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
function _pairwise!(::Val{:none}, f, dest::AbstractMatrix{V}, x, y,
symmetric::Bool) where {V}
function _pairwise!(::Val{:none}, f, dest::AbstractMatrix{V}, x, y,
symmetric::Bool) where {V}

src/pairwise.jl Show resolved Hide resolved
Comment on lines +759 to +761
# eachcol was added in Julia 1.1 but for Julia 1.8, keys(eachcol(x)) fails for any Matrix x
# which causes `check_vectors` to fail. Solution is to use the comprehension below.
_eachcol(x) = VERSION < v"1.9" ? [view(x, :, i) for i in axes(x, 2)] : eachcol(x)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I though check_vectors allowed any iterator, as the docstring for pairwise indicates. It doesn't seem to call keys on the input. Things like pairwise(cor, eachcol(rand(4, 5)), eachcol(rand(4, 5))) work on Julia 1.6.

Anyway even if it works around an error, this function doesn't fix it when users pass eachcol(...) so we need a better solution.

# _isnan required so that corkendall and corspearman have correct handling of NaNs and
# can also accept arguments with element type for which isnan is not defined but isless is
# is defined, so that rank correlation makes sense.
_isnan(x::T) where {T<:Number} = isnan(x)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
_isnan(x::T) where {T<:Number} = isnan(x)
_isnan(x::Number) = isnan(x)

end

if missing isa T || missing isa V
sortedx, permutedy = handle_pairwise(sortedx, scratch_py; scratch_fx=scratch_fx, scratch_fy=scratch_fy)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assumes pairwise deletion but what about skipmissing=:listwise?

#cov(x) is faster than cov(x, x)
(f == cov) && (f = ((x, y) -> x === y ? cov(x) : cov(x, y)))

Threads.@threads for subset in EqualSumSubsets(nc, Threads.nthreads())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Continuing #923 (comment))

This approach is clever, but I also wonder whether it wouldn't be possible to achieve the same result with less code. If you identify a cell in the upper triangle using an integer index, you can just split these equally across threads. Then you can compute back what row and column the integer index corresponds to (e.g. via i, j = Tuple(CartesianIndices(dest))). If that's simpler you could even attribute indices from the full table to tasks, and skip those from the lower triangle from inside tasks. Am I missing something?

It could also make sense to spawn more tasks than Threads.nthreads when the input matrix is really large. Indeed, we don't have any guarantee that there are Threads.nthreads() idle threads. We may be operating within a context which is already threaded, and for example one thread, or two thirds of the threads may already be busy; another app may also be using cores. So when we're sure that the overhead of spawning threads is small compared to the total cost of the operation it's worth spawning lots of tasks that can then be attributed to idle threads as appropriate. This is what we do in DataFrames.jl, with a threshold so that we spawn a new task for each batch of N rows.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants