-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
missing gather/scatter #4
Comments
Correct. It's not there and not proposed yet. Note that I just posted a request for feedback on the Parallelism TS 2. You can submit feedback at https://github.com/mattkretz/std-simd-feedback/ which I will collect into a paper for (one of) the next meeting(s). Anyway, my plan is to finish the non-member subscript paper (I'd take contributions 😉). See what EWG thinks of it and then decide how to design the gather/scatter interface. However, we could start earlier, since low-level gather/scatter (i.e. without using nice subscript interfaces) will be needed as well. At this point, one needs a non-std extension to |
I can't get your papers to build here. Maybe you have a readymade copy of the relevant paper somewhere?
I felt that the standard's constructor for simd using a generator function suggests code like this: template < typename T , typename A >
struct gs_simd // class simd with gather/scatter
: public std::experimental::simd < T , A >
{
typedef std::experimental::simd < T , A > base_t ;
// perfect-forward any arguments to the base class c'tor.
template < typename ... types >
gs_simd ( types && ... args )
: base_t ( std::forward < types > ( args ) ... )
{ } ;
template < typename indexable_t ,
typename index_t >
void gather ( const indexable_t & src ,
const index_t & indexes )
{
// assign base_t object created by gen-type c'tor
*this = base_t ( [&] ( const size_t & i )
{ return src [ indexes[i] ] ; } ) ;
}
template < typename indexable_t ,
typename index_t >
void scatter ( indexable_t & trg ,
const index_t & indexes )
{
// gen-type c'tor is only used for side effects; let the compiler
// figure out that the result is unused
base_t ( [&] ( const size_t & i )
{ return trg [ indexes[i] ] = (*this)[i] ; } ) ;
}
} ; Mileage does of course vary with the compiler and ISA used, but it would be a start. It works for all simd objects, and gives a reference implementation for doublechecking. |
BTW, coding the g/s this way also encompasses 'regular' g/s (from/to strided memory), just by passing a suitable object as 'indexes':
|
I just added a link from the README to the paper: https://web-docs.gsi.de/~mkretz/DNMSO.pdf. And yes, I agree that also allowing a type like Regarding the generator ctor. Functionally, this ctor will plug any missing-feature hole. In terms of optimization (which is the reason for In terms of low-level API I'm confident with https://github.com/VcDevel/Vc/blob/1.4/Vc/common/gatherinterface.h and https://github.com/VcDevel/Vc/blob/1.4/Vc/common/scatterinterface.h to work. Though the optional The Anyway, I'll start on a paper ASAP. I'd be happy if you want to co-author. |
Oh, and of course, I need to work on an implementation to go with the paper... |
I meant it the other way round: pass strides_t<4> as 'indexes'.
It does have great conceptual appeal, though, because it keeps the interface really small and simple and burdens the search for optimizations to 'further inside'. I did a lot of playing around with self-made 'pseudo-vector' types (you can look at my code here) and for the 'basics' (like, all of vspline) the resulting code came out just about as fast with Vc as without - nowadays compilers are very clever indeed (I use clang++, which does a better job for me than g++)
Thanks for the offer. I won't have time for a couple of weeks now. I'll get in touch when I'm back 'on it'. |
That's also what I meant (or I'm still completely misunderstanding what you're saying). Basically void f(std::array<float, N>& data, simd<int> indexes) {
auto x = simd<float>::gather(data.data(), indexes); // data[indexes[{0,1,2,3,...}]]
auto y = simd<float>::gather(data.data(), std::strided<4>()); // data[{0,4,8,12,...}]
...
}
I totally agree. And I won't stop anyone from using the generator ctor for such things. That's what it's meant to be used for. I failed to mention it here, but at least as important is the syntax issue. histogram[index_vec] += 1; is nicer than simd<int>([&](auto i) {
histogram[index_vec[i]] += 1;
}); Also, the generator callable is one of those weird code regions where exceptions, I/O, and locking mutexes may be UB. (just like for |
Sorry for the delay. Now I'm available again.
I misunderstood your comment. So we agree. Yes, indexes would usually be simd or, in a more generalized way, any indexable object yielding integral types with operator[]. The general g/s formulation via the gen ctor covers all that and can serve as reference implementation. Then, by overloading the g/s methods, specific types of index_t objects - like simd - can dispatch to tweaked code, all the way down to directly using intrinsics if the vector sizes match the hardware vector size.
I agree. It's great to have, to write generic code which works for both single and simdized indexes. But this is beyond g/s; it's a full gather/modify/scatter cycle. I think you'd like to avoid making g/s explicit and use subscripting with index vectors exclusively. I like having explict g/s methods, because it makes it easier to opt out by writing a class which can stand as a replacement for std::simd with the same interface. Coding with explicit g/s is more verbose, but the implementation can be very compact. Initially, when I started using Vc, I just used Vc's API exclusively. After some time I had doubts about Vc's long-term viability and accessability and decided to become able to opt out, by building a class template to emulate the funcionality of Vc which my program was actually using, and at the same time I tried to narrow down my use of vector code to some comfortable minimum to keep the amount of code I had to emulate small. It turned out that using explicit g/s was helpful in doing that. The emulator class template uses small loops which can be picked up by the compiler's loop vectorizer. Next I added an option to use Vc, which delegates some operations to Vc. This way my code can run either way and allow users to avoid using Vc - e.g. because the version their package management offers is too old or they'd have to use Vc from source. This gave me an idea for the task at hand: since we already have Vc with all of it's well-thought-out g/s functionality, how about delegating to Vc to have some 'real' g/s code to compare with g/s via the gen c'tor? Like:
This would provide a test-bed with minimal effort, which could be used to make the case for making g/s part of the standard (if actual test code does support the notion that this is faster). If these preliminary tests suggest it's a good idea, the Vc code can be pulled into your std::simd implementation. |
A year has passed, so I thought I'd get back to this issue again. Just to remind you where I'm coming from, my library vspline requires efficient gather/scatter operations and (de)interleaving via load/shuffle/store. I currently use Vc 1.4.1 for the purpose. I'd like to eventually move to std::simd, but the standard does not address these issues. |
Again I'll post to this issue, following up our recent exchange on the Vc issue tracker. I took your progress with std::simd development as an encouragement to try and port pv, my image viewer (https://bitbucket.org/kfj/pv) to use std::simd instead of Vc. This has succeeded, using a wrapper class to produce a data type which internally is a std::simd (by private inheritance) and yet exposes an interface which is similar to Vc's SimdArray, to make it usable throughout pv's and vspline's code base. So now I can offer a large-ish, complex application, using your std::simd implementation extensively, to play with. I set up a new branch 'stdsimd' in the pv repo, the implementation of my wrapper class is here: https://bitbucket.org/kfj/pv/src/stdsimd/vspline/simd_type.h This class may be of interest to other Vc users wanting to port their code to std::simd, because it's interface covers a good part of Vc::SimdArray functionality. Performance of a pv binary using this code, clang++ 10.0.0 and -Ofast, is about 30% worse than the Vc version. I tried 'slotting in' Vc's gather and scatter code by reinterpret_casting the std::simd types to their equivalent Vc types (this does indeed work), but this slowed things down. My code to employ where_expressions is probably suboptimal. I suspect that use of an equivalent of Vc's interleaved_memory_mapper may improve performance (by less than 10%). Another issue might be the vectorization of standard functions, which may still be better in Vc (especially the transcendental ones, which are used a lot in pv). So this is encouraging, but it's not quite there yet. |
Going through the 'Working Draft, C++Extensions forParallelism Version 2', I could not find any reference to gather/scatter functions. How is this supposed to be done? Additionally, I think that strided gather/scatter, where the data are equidistant in memory, would also be desirable.
The text was updated successfully, but these errors were encountered: