SIMD in the lexer #2285
Replies: 5 comments 11 replies
-
1. Enable SIMD only on Rust Nightly with a feature flag, and use portable-simd.We had From Joe (React team lead): https://twitter.com/en_JS/status/1676484656575946753
|
Beta Was this translation helpful? Give feedback.
-
2. Implement from scratch inside OXC. |
Beta Was this translation helpful? Give feedback.
-
3. Implement the SIMD primitives from scratch in an external crate (which would export generic Vector types like these), and then build OXC-specific functions around those types within OXC. |
Beta Was this translation helpful? Give feedback.
-
4. Attempt to collaborate with maintainers of memchr to add what we need to memchr. |
Beta Was this translation helpful? Give feedback.
-
|
Beta Was this translation helpful? Give feedback.
-
Some parts of the lexer could likely be accelerated significantly by using SIMD instructions, particularly those which chew through a series of bytes searching for an end character:
In all of these cases, the algorithms themselves in a SIMD implementation would not be very complex - use either
cmpeq
orpshufd
+movemask
+trailing_zeros()
to locate first byte in a batch matching a certain pattern. It's only a few lines, and there are some good examples to follow, like memchr and simd-json.However, there are practical difficulties. I'm opening this issue to discuss options and their feasibility.
Compilation targets
The biggest headache is supporting multiple compilation targets (x86 with SSE2, SSE4, AVX, AVX2, AVX-512 / aarch64 NEON / WASM / fallback for other architectures).
portable-simd is designed to tackle exactly this problem, but it requires Rust Nightly, so is not an option for OXC at present. By looks of things, portable-simd's API is still undergoing some churn, and judging by Rust's usual anxiety about covering every possibility before making a feature stable, I'm guessing it could take several years to stabilize.
I cannot find any crates which offer what we need prebuilt, which cover both x86 and aarch64. I assume both these are essential, as servers will tend to be x86, and many developers will be running OXC on MacBooks with ARM M-series chips. WASM support would also be nice.
memchr does cover all these targets, but can only match against a maximum of 3 different byte values, which is insufficient for our needs. It's also optimized for finding uncommon needles in large haystacks, whereas in OXC often you'd expect a match to be found in first batch (almost all JS identifiers are less than 32 characters long, for example).
Some possible approaches we could take:
portable-simd
.Vector
types like these), and then build OXC-specific functions around those types within OXC.If
portable-simd
on nightly onlyThis is by far the easiest route to getting something working.
How would some of OXC's notable consumers (e.g. Rolldown) feel about this approach?
@Boshen Would you be comfortable with the NodeJS NAPI build and WASM build using nightly?
If implementing from scratch
Implementing from scratch would be challenging. memchr's code to switch between different implementations depending on architecture is quite forbidding.
We would also need to answer the question of whether to select which implementation to use:
Compile time is simpler to implement, but requires compiling with
-C target-cpu=native
to use the fastest implementation available on a particular machine, or building multiple binaries for SSE2, AVX, AVX2 etc.Run time selection allows a single binary to switch to implementation which uses fastest instructions and widest vector width available on the machine but, judging from the code memchr uses to do this, is pretty gnarly to do well.
Testing infrastructure
At present OXC's tests run on CI only on a single flavour of x86_64, and Mac. Benchmarks and conformance suite only run on x86_64.
Introducing SIMD (regardless of which approach we take) would make the implementations on different platforms quite different. A bug or performance regression could happen one architecture, without manifesting on another.
So we'd need to run tests, conformance, and benchmarks on a variety of different platforms.
Covering Mac is easy - just run the benchmarks and conformance suite on Mac as well as Linux.
But I don't have any idea how to cover different x86 instruction sets (SSE2, SSE4, AVX, AVX2, AVX512). Presumably Github runners don't offer that range of options, and I don't know if they guarantee all runners support all of them, even if most do.
Is it possible to get code to compile for a specific instruction set on CI, and then run tests and benchmarks for e.g. the SSE2 version on an AVX512-equipped runner? And if so, would that be a good enough proxy for the "real deal"?
Other notes
Some SIMD is possible without using any explicit SIMD primitives. The compiler will automatically vectorize standard Rust code which is written in a certain way. I had hoped we might be able to do everything we need through coaxing the compiler to generate efficient SIMD code itself. But, from asking on Rust lang forum, it seems it's not possible to generate movemask instructions with this method, which is essential to a fast implementation, as far as I can see.
Just to say, I'm not that knowledgeable at all about SIMD. Have only done a bit of research over past few weeks. Apologies if some of the above is complete nonsense!
Are there really no crates?
I'm quite surprised to find there don't seem to be any crates available which help with this. memchr and the like contain what's required, but their code is custom for exactly their use cases, and the primitives are not exposed, or reusable.
Have I just not found one yet? Anyone aware of something suitable which already exists?
Beta Was this translation helpful? Give feedback.
All reactions