Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

serdect: safer bytes handling #1112

Merged
merged 9 commits into from
Oct 31, 2023
Merged

Conversation

fjarri
Copy link
Contributor

@fjarri fjarri commented Jun 21, 2023

See #1111 for the discussion leading to this PR.

Note: this is a breaking change in the ABI.

Previous behavior: byte slices were serialized with the default serde implementation for [T] with T = u8. For some formats like MessagePack or CBOR it leads to array serialization, and separate u8 values get treated as packed integers - that is, every value over 127 is prefixed (0xCC in MessagePack, 0x18 in CBOR). This leads to non-constant-time behavior, and also is counter-intuitive, because it results in variable serialized length for the byte slices of the same length.

Changes:

  • Use serialized_bytes in the slice module, funneling to the specific "bytestring" pathways in corresponding formats, which don't do packing.
  • slice::deserialize_hex_bin_vec() changed accordingly. Note that slice::deserialize_hex_or_bin() deserialized from a bytestring already, so it was inconsistent with the serializer (which could be revealed in CBOR tests, if it was used there).
  • Added a table with supported formats to Readme.md.

In tests, I replaced the 0x0F value with 0xFF to check if the packing occurs. CBOR and MessagePack do packing, so their tests include the packing markers in the tests for array serialization, and the lack of constant-timeness is reflected in the table in Readme.

@tarcieri
Copy link
Member

In tests, I replaced the 0x0F value with 0xFF to check if the packing occurs. Currently the CBOR array tests fail because of it.

Perhaps you can leave it alone for now and we can examine arrays separately in a followup.

I'd also be curious to know which formats we currently test against this impacts and what the impact is, i.e. is this a mostly backwards compatible change or are there formats other than MessagePack where this is a breaking change.

Note that if this is a breaking change it will take quite awhile (i.e. months) to roll out as serdect is a very deep dependency.

@fjarri
Copy link
Contributor Author

fjarri commented Jun 21, 2023

It is a breaking change for CBOR because the tags for an array and a bytestring are different (see the change in the test). Bincode seems to be unaffected.

Copy link
Contributor

@daxpedda daxpedda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we also add an assert!() in the proptest!s checking that the length remains the same?
EDIT: I think we should also add 0xFF to the test bytes of the other formats as well.

@tarcieri this is definitely a breaking change because CBOR's output is changed.
Also apparently CBOR exhibits the same issue as MessagePack (which is also why it did change it's output). I think we should definitely fix that for arrays as well.

serdect/README.md Outdated Show resolved Hide resolved
serdect/README.md Outdated Show resolved Hide resolved
serdect/src/slice.rs Outdated Show resolved Hide resolved
serdect/tests/cbor.rs Outdated Show resolved Hide resolved
serdect/tests/cbor.rs Outdated Show resolved Hide resolved
serdect/tests/messagepack.rs Outdated Show resolved Hide resolved
@tarcieri
Copy link
Member

Since it's a breaking change, please bump the serdect version to v0.3.0-pre.

Also please be aware since it's a breaking change that we can't update elliptic-curve until elliptic-curve v0.14. That will also involve cutting a breaking crypto-bigint v0.6 release. All of that is several months away at least.

Alternatively consider adding new APIs and deprecating the old ones.

@tarcieri
Copy link
Member

tarcieri commented Jun 21, 2023

Re: array changes, ugh, that just really really sucks. It is just very very unfortunate have to break everyone's serializations to add a completely redundant length prefix to work around deficient format designs/implementations.

That's going to be a very ugly breaking change with wide-ranging impact, which will needlessly make a whole slew of messages longer, and all serializations that don't include the redundant length prefix incompatible with the newer, needlessly longer format.

Is there really no better solution? It seems like an absolute last resort. Perhaps it's worth inquiring upstream for potential alternative solutions?

Any of you volunteering to support and explain the changes to people who will inevitably ask why we're breaking all existing serializations of arrays?

@tarcieri
Copy link
Member

Also what's the solution for people who have existing message serializations? Are they just screwed and have to serialize everything?

@daxpedda
Copy link
Contributor

It is just very very unfortunate have to break everyone's serializations to add a completely redundant length prefix to work around deficient format designs/implementations.

Just to clarify, imho it's not a deficiency in the implementations but in Serde itself. If we communicate we want a tuple, they will use their tuple encoding. Serde just doesn't offer a way to communicate homogeneous fixed-size sequences.

Perhaps it's worth inquiring upstream for potential alternative solutions?

As pointed out above as far as I understand the implementations themselves can't do much about it. I was initially thinking they could detect that the tuple is homogeneous and use their array encoding ... but that's not right either, because some users require specific formats e.g. implementing a specific protocol, that then would break.

It could otherwise be done by making it a non-standard option, e.g. CBOR/MessagePack could have an option to turn off varint encoding in specific cases.

Otherwise it's up to Serde, which isn't gonna happen anytime soon I assume.

Any of you volunteering to support and explain the changes to people who will inevitably ask why we're breaking all existing serializations of arrays?

I'm happy to respond to issues and help with support (as long as it's Rust related).

Also what's the solution for people who have existing message serializations? Are they just screwed and have to serialize everything?

I think so? But I guess that's the point of breaking changes?


  • We could just not do this and leave a note in the Readme stating compatibility issues with certain protocols, e.g. CBOR and MessagePack.
  • Alternatively we could just offer two sets of APIs, but this doesn't really make a whole lot of sense because e.g. if you are using crypto-bigint, you can't really select which one to use. Unless we use crate features or some other not so viable workaround.

I honestly don't see a good way around this.
The only viable one is just to leave it incompatible with certain formats.

@tarcieri
Copy link
Member

Since the array serializer under this scheme would lose the advantage of not having a length prefix, perhaps it should be deprecated (or made into a thin wrapper around a slice deserializer).

@tarcieri
Copy link
Member

I think so? But I guess that's the point of breaking changes?

Breaking changes can still have a migration path, and we're not really offering one here

@daxpedda
Copy link
Contributor

Since the array serializer under this scheme would lose the advantage of not having a length prefix, perhaps it should be deprecated (or made into a thin wrapper around a slice deserializer).

I think the argument that @fjarri made before is that the array serializer does bound checking, which the slice one does not.

I think so? But I guess that's the point of breaking changes?

Breaking changes can still have a migration path, and we're not really offering one here

I honestly can't think of one. It would require some sort of versioning support, which we don't have in place either.
Really the only migration users can do is just to deserialize all their existing stuff with the old version and serialize it back into the new version.

It's a bleak outlook I agree :/

@daxpedda
Copy link
Contributor

Looking at Serde it actually did have support for this in the past but was removed in favor of tuples.
Somebody hit the exact same problem as well: serde-rs/serde#2120 (comment).

Related:

In addition Serde also uses serialize_tuple for fixed sized arrays themselves: https://docs.rs/serde/1.0.164/src/serde/ser/impls.rs.html#140-168.

@fjarri
Copy link
Contributor Author

fjarri commented Jun 21, 2023

Just to clarify, imho it's not a deficiency in the implementations but in Serde itself.

As far as I understand it's a deficiency of specific formats. For MessagePack at the very least, probably for CBOR too. serde can't do much about it.

I think the argument that @fjarri made before is that the array serializer does bound checking, which the slice one does not.

Yes, I would like something that I can use with arrays with the least amount of boilerplate. Ideally, something I can just put into #[serde(with)] annotation, for an array, a vector, or a Box<[u8]> (this currently works for serializers but not for deserializers).

It's a bleak outlook I agree :/

There won't be that many users affected after all, the only thing that breaks is CBOR + slice (in this PR specifically, assuming we don't change arrays to slice behavior). The other tested formats are unchanged.

By the way, I did not want to cause all this frustration. I just wanted something I could use in place of https://github.com/nucypher/rust-umbral/blob/master/umbral-pre/src/serde_bytes.rs for my crates without copying it everywhere. I thought serdect could be that after some adjustments, but if that does not work, I have no problems just making my own crate.

@tarcieri
Copy link
Member

I guess we should try to fix it, including arrays, though it will take quite some time to get it rolled out

@daxpedda
Copy link
Contributor

Just to clarify, imho it's not a deficiency in the implementations but in Serde itself.

As far as I understand it's a deficiency of specific formats. For MessagePack at the very least, probably for CBOR too. serde can't do much about it.

I'm pretty sure this is incorrect, MessagePack only uses varint encoding because Serde communicates this array as a tuple, which MessagePack handles differently. The same can be said about CBOR, which functions as desired when we are using the proper encoding that CBOR does provide.

By the way, I did not want to cause all this frustration. I just wanted something I could use in place of https://github.com/nucypher/rust-umbral/blob/master/umbral-pre/src/serde_bytes.rs for my crates without copying it everywhere. I thought serdect could be that after some adjustments, but if that does not work, I have no problems just making my own crate.

I consider this whole issue a bug. We misused Serde's API, is how I see it. So thanks for figuring this all out, reporting it and helping us understand it!

@fjarri
Copy link
Contributor Author

fjarri commented Jun 21, 2023

I'm pretty sure this is incorrect, MessagePack only uses varint encoding because Serde communicates this array as a tuple, which MessagePack handles differently.

How would you represent a fixed-size u8 array without a length specifier in MessagePack format?

Because that's what I meant when I said that it's a deficiency of formats themselves. If you consider a length specifier to be acceptable for fixed-size arrays, then yes, it is a problem on the serdect side and can be fixed. And I guess serde is at fault for not providing ways to say "this byte array is fixed-size" and convey that to format implementations, so that crates like bincode that support it could serialize them without the length specifier.

@daxpedda
Copy link
Contributor

How would you represent a fixed-size u8 array without a length specifier in MessagePack format?

I assume with "length specifier" you mean varint encoding? That would be the "bin 8" format. Which is the format rmp-serde uses for encode_bytes(). This encoding does not use any varint encoding.

The problem is that encode_tuple() can't tell rmp-serde that this is a homogeneous array of bytes, that is why it chooses an inappropriate MessagePack format.
In the case of rmp-serde, it uses either the "fixarray", "array 16" or "array 32" format and then encodes every element with it's own format, which in our case is u8. For encoding an individual u8 rmp-serde chooses either "positive fixint", "uint 8", "uint 16", "uint 32" or "uint 64", which is what creates the varint encoding.

So what I'm trying to get here is, that encode_tuple doesn't give the correct information to rmp-serde to make the best decision here. Even though MessagePack doesn't have a dedicated format for fixed size array, it has the "bin" format, which is the right format to use in this case.
If Serde had a serialize_fixed_bytes() function, MessagePack could do the right thing, which in this case is doing the same thing that serialize_bytes() does, because the MessagePack format can't optimize this type any further.


If you meant with length specifier just the length of the byte string ... then AFAIK MessagePack can't do that.
But it should still pick the best format for the task, which is still "bin" and not "array".

@fjarri
Copy link
Contributor Author

fjarri commented Jun 21, 2023

I assume with "length specifier" you mean varint encoding? That would be the "bin 8" format.

No, the information about the length of the following array. Like, for a Messagepack "DC0010000102030405060708090A0B0C0D0ECCFF", 0x10 is the length specifier. Bincode does not add that to fixed-size arrays.

If Serde had a serialize_fixed_bytes() function, MessagePack could do the right thing

Well, it could do an array with a length specifier, and bincode could do the actual right thing because the format allows it. But yes, these are my thoughts too. It would be great if serde provided that possibility.

Comment on lines 32 to 43
The table below lists the crates `serdect` is tested against.
&#x274c; marks the cases for which the serialization is not constant-size for a given data size (due to the format limitations), and therefore will not be constant-time too.

| Crate | `array` | `slice` |
|--------------------------------------------------------------------|:--------:|:--------:|
| [`bincode`](https://crates.io/crates/bincode) v1 | &#x2705; | &#x2705; |
| [`ciborium`](https://crates.io/crates/ciborium) v0.2 | &#x274c; | &#x2705; |
| [`rmp-serde`](https://crates.io/crates/rmp-serde) v0.2 | &#x274c; | &#x2705; |
| [`serde-json-core`](https://crates.io/crates/serde-json-core) v0.5 | &#x2705; | &#x2705; |
| [`serde-json`](https://crates.io/crates/serde-json) v1 | &#x2705; | &#x2705; |
| [`toml`](https://crates.io/crates/toml) v0.7 | &#x2705; | &#x2705; |

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we are good to go to fix arrays in this PR as well, please correct me if I'm wrong @tarcieri, so this table could probably just list crates that we test against.

@daxpedda
Copy link
Contributor

If Serde had a serialize_fixed_bytes() function, MessagePack could do the right thing

Well, it could do an array with a length specifier, [..]

rmp-serde does make an array with a length specifier (in MessagePack all arrays use length specifiers), which is the problem, because then every element needs to be encoded separately, which introduces the varint encoding.

[..], and bincode could do the actual right thing because the format allows it.

Well, Bincode doesn't actually use varint encoding by default, that's why it doesn't hit that problem. Additionally, Bincode's varint encoding doesn't affect u8s.

So if you used Bincode with varint encoding, a [u16; N] and used serialize_tuple(), you would have the exact same problem in Bincode. In this case Serde fails even further, because you can't communicate non-u8 slices at all, serialize_bytes() only takes &[u8].

Generally speaking you want to use varint encoding if you are dealing with individual integers, but not integer strings. Which is exactly what Serde doesn't expose, but both MessagePack and CBOR support (and Bincode would too).

serdect/Cargo.toml Outdated Show resolved Hide resolved
@daxpedda
Copy link
Contributor

@fjarri are you still interested in pushing this PR over the finishing line?
According to tarcieri, #1112 (comment), we have the green light to fix arrays as well now.

@fjarri
Copy link
Contributor Author

fjarri commented Jul 21, 2023 via email

@tarcieri
Copy link
Member

I'd prefer to save this for the next breaking release cycle, so maybe you can finish it up before we merge (no rush)

Since serde does not support uniform serialization of fixed-sized arrays.
@fjarri
Copy link
Contributor Author

fjarri commented Aug 2, 2023

Made the changes for the arrays. Not sure if LengthCheck makes sense, or I should have just copy-pasted the implementation for simplicity. Also note that deserialize for slices does not return the reference to the buffer anymore (since it's mutated in place anyway).

Copy link
Contributor

@daxpedda daxpedda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay!

Just a couple of documentation and test nits.
Would also like to see an assert! in the proptests to make sure the length actually stays the same. Basically testing what this PR is fixing.

serdect/src/array.rs Outdated Show resolved Hide resolved
serdect/tests/cbor.rs Outdated Show resolved Hide resolved
serdect/tests/messagepack.rs Outdated Show resolved Hide resolved
Copy link
Contributor

@daxpedda daxpedda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I forgot to mention one last change I would like to see: move the Visitor implementation into it's own file. Now that it's equally shared between slice and array implementation, it doesn't make sense that it's living in the slice module.

Would also be interesting if we could move ExactLength and UpperBound into their respective modules.

@fjarri fjarri force-pushed the serdect-bytes branch 2 times, most recently from a155af0 to d8e666a Compare August 6, 2023 20:21
Copy link
Contributor

@daxpedda daxpedda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

@tarcieri tarcieri merged commit 57b7d02 into RustCrypto:master Oct 31, 2023
172 checks passed
tarcieri added a commit that referenced this pull request Oct 31, 2023
PR #1112 contained breaking changes. To reflect that, this bumps the
`serdect` version to a prerelease.

Note that there is not a crate release associated with this bump. If
there is a prerelease, it will be `v0.3.0-pre.0`.
tarcieri added a commit that referenced this pull request Oct 31, 2023
PR #1112 contained breaking changes. To reflect that, this bumps the
`serdect` version to a prerelease.

Note that there is not a crate release associated with this bump. If
there is a prerelease, it will be `v0.3.0-pre.0`.
@fjarri fjarri deleted the serdect-bytes branch November 1, 2023 06:34
Comment on lines -45 to +64
pub fn deserialize_hex_or_bin<'de, D>(buffer: &mut [u8], deserializer: D) -> Result<&[u8], D::Error>
pub fn deserialize_hex_or_bin<'de, D>(buffer: &mut [u8], deserializer: D) -> Result<(), D::Error>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fjarri running into problems with this change when upgrading this crate.

This API in particular is for deserializing variable-length data, however removing the &[u8] return value means the amount of data that was actually deserialized was lost.

buffer is intended to hold a backing buffer which is as large as the deserialized data can possibly be, but it may be smaller, and the return value is there to return the actual amount of data deserialized.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made a tracking issue: #1322

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants