-
Notifications
You must be signed in to change notification settings - Fork 153
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
chacha20poly1305
: Implement one-pass encryption and decryption
#414
base: master
Are you sure you want to change the base?
Conversation
Currently this still makes things slower. I've benchmarked per-commit, as migrating to Upgrade to `universal-hash 0.5` and latest `poly1305`:
Using one-pass encryption (relative to `universal-hash 0.5`):
|
I think it might be worth experimenting with adding some (unsafe) |
It's not surprising that you get degraded performance with such core loops. Not only you use the stream cipher with built-in buffering which is not needed here, but also each call to You should use the block-level ChaCha type and the code should look roughly like this (it's a very rough draft to demonstrate the idea): let mut cipher: ChaChaCore<U10> = ...;
let mut mac: Poly1305 = ...;
// For simplification I will use closure syntax here instead of
// the explicit implementation of the closure traits
cipher.process_with_backend(|cipher_backend| {
mac.update_with_backend(|mac_backend| {
// process AAD with mac_backend
...
// least common multiple of number of bytes which
// can be processed by each backend in parallel
let lcm_block_size = lcm(
cipher_backend::BlockSize::USIZE * cipher_backend::ParBlocksSize::USIZE,
mac_backend::BlockSize::USIZE * mac_backend::ParBlocksSize::USIZE,
);
let mut iter = buffer.chunks_exact_mut(lcm_block_size);
for lcm_block in &mut iter {
// we do not have the `cast` helper function right now.
// it's probably will be worth to add it later to one
// of our traits or utils crates
let cipher_blocks: &mut [ChaChaParBlocks] = cast_mut(lcm_block);
let mut par_ks_blocks: ChaChaParBlocks = Default::default();
for par_blocks in cipher_blocks {
cipher_backend.gen_par_ks_blocks(&mut par_ks_blocks);
xor(par_blocks, &par_ks_blocks);
}
let mac_blocks: &[Poly1305ParBlocks] = cast(lcm_block);
for par_blocks in cipher_blocks {
mac_backend.proc_par_blocks(par_blocks);
}
}
let tail = iter.into_remainder();
// process tail bytes
...
})
}); The idea is that compiler after inlining the callbacks will generate the core loop for each combination of supported backends. For hypothetical backends which do not support parallel processing (i.e. let mut iter = buffer.chunks_exact_mut(64);
for block in &mut iter {
cipher.apply_ks_block(&mut block);
mac.update_block(&block[0..16]);
mac.update_block(&block[16..32]);
mac.update_block(&block[32..48]);
mac.update_block(&block[48..64]);
} Which allows compiler to perform all necessary optimizations for improved performance, such as keeping data in registers without unloading it onto stack. I think we can start with the non-interleaved version and I will try add an interleaved version for encryption later. As for decryption, I don't think we should perform decryption before auth tag was fully verified, since in some corner cases it may lead to an unintended exposure of plaintext. But I guess it will depend on how much performance benefit the interleaving could give us. |
Aha, this nested backend approach makes much more sense! I look forward to it being documented in the traits, but in the meantime I'll give this a go. |
Ah, I now see why I didn't stumble upon this approach myself. The |
Yeah, the "core" part in the docs of
No, it does not make sense. |
217851e
to
d119025
Compare
Codecov ReportBase: 81.54% // Head: 80.47% // Decreases project coverage by
Additional details and impacted files@@ Coverage Diff @@
## master #414 +/- ##
==========================================
- Coverage 81.54% 80.47% -1.07%
==========================================
Files 32 32
Lines 1284 1439 +155
==========================================
+ Hits 1047 1158 +111
- Misses 237 281 +44
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
d119025
to
a23afb2
Compare
I've switched to using the nested closures approach, and it's still significantly slower. Next step would be to look at the assembly changes between #415 and this PR, to figure out what's going on. Some of it might be the misalignment due to AD, but I don't currently buy that as the entire cause. |
Status update: I've tried several times over the last 5 months to get this to work, and even with |
I will reiterate my previous suggestion from upthread that we should try to get things properly pipelined on |
I think I also have an attempt at that somewhere locally (that I started before learning of @newpavlov's trait APIs). I'll see if I can dig it up. |
a23afb2
to
fcf134b
Compare
Rebased on latest |
The test case was generated by our implementation.
fcf134b
to
92c8901
Compare
92c8901
to
ac660d5
Compare
Force-pushed to use I left a very large comment in the code documenting the alignment problem. Something I have tried in the past, and am considering trying again, is to pull apart one of the ChaCha20Poly1305 assembly implementations to figure out how they deal with the fundamental alignment issue caused by the associated data. |
@str4d now that inline ASM is stable, we can also add an |
Builds on #415.