Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support NEON instruction set #12

Open
GCCFeli opened this issue Sep 23, 2016 · 12 comments
Open

Support NEON instruction set #12

GCCFeli opened this issue Sep 23, 2016 · 12 comments

Comments

@GCCFeli
Copy link

GCCFeli commented Sep 23, 2016

It would be great if NEON is supported :)

@guillaumeblanc
Copy link
Owner

Yes it definitely would. I would have no ARM hardware to test the implementation though.

The process to port ozz SIMD implementation is:

  • In simd_math_config.h file:
    • Add NEON detection based on __ARM_NEON preprocessor definition
    • Include <arm_neon.h>
    • Typedef SimdFloat4 and SimdInt4 with NEON types..
    • Include simd_math_neon-inl.h which will contain neon implementation.
  • Add a new file in ozz/base/math/internal folder named simd_math_neon-inl.h.

The whole library, including SoA implementation, is based on the functions from simd_math_*-inl.h, so there's nothing else needed.

@guillaumeblanc
Copy link
Owner

I reopen the request as I think it makes a lot of sense to implement it indeed.

@jazzbre
Copy link

jazzbre commented Sep 29, 2016

https://github.com/scoopr/vectorial or even this one https://github.com/jratcliff63367/sse2neon -> good reference for sse/neon implementation.

@kylawl
Copy link
Contributor

kylawl commented Aug 1, 2019

We're going to be starting on Switch soon. Expect a PR early next year, but if someone wants to do it before us, that would be nice!

@guillaumeblanc
Copy link
Owner

Awesome news @kylawl. Don't hesitate to reach me if you want to discuss this or need help/support.

@kylawl
Copy link
Contributor

kylawl commented May 26, 2021

So it's been a while and I'm back looking at this again. As a first step, I thought I'd just try using sse2neon to see if there's any benefit from simply aliasing all the instructions raw like that. Performance is actually surprisingly poor going this route on Switch. The sse reference implementation takes about 1.2ms for our whole animation phase while using sse2neon takes 2.7ms! Not exactly the sort of thing I was expecting/hoping for.

I've seen some discussion that we could be throttled due to memory access overhead rather than computation, going to need some more investigation.

@ColinGilbert
Copy link

If I remember correctly, Bullet physics had code contributed by Apple that made it very performant on ARM/iOS. Maybe that would be worth looking at?

@guillaumeblanc
Copy link
Owner

Welcome back!

You say 1.2ms for "sse reference implementation". Do you mean float/scalar reference implementation? If so, it could be worth checking the generated code, to see how much the compiler auto-vectorizes the code. All the SoA usages of the math library in ozz are very easy for the compiler to auto-vectorize, so maybe neon is already at use. That doesn't mean 1.2ms can not be optimized, but optimization expectations would be lower.

Are the memory access overhead issues you mentioned specific to neon?

@kylawl
Copy link
Contributor

kylawl commented Jun 6, 2021

You're probably right that the autovectorization is doing a decent job. One thing that sse2neon misses is the common shuffle operations that we do to splat the same value into all 4 components. For that particular shuffle operation, they use a multi instruction "generic" path even though arm has a specific instruction for handling that operation. After spending some more time on switch optimizations, I don't think this is a memory access issue. Needs further investigation for sure.

@guillaumeblanc
Copy link
Owner

Hi,

what did you end up doing on Switch? Did you need/implement neon optimizations ?

Cheers,
Guillaume

@kylawl
Copy link
Contributor

kylawl commented Mar 4, 2024 via email

@guillaumeblanc
Copy link
Owner

No worries, thanks for the feedback.
I think it's good to know that reference implementation provides good results as a cross-platform fallback.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants