Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Partial Checksum Extension #172

Open
wants to merge 6 commits into
base: 1.1
Choose a base branch
from

Conversation

felix-schwarz
Copy link

When using just the Checksum extension, the Server is required to discard the data of PATCH requests that ended prematurely (f.ex. because of a loss of connection), so that Clients need to re-send the entire chunk.

This extension allows Clients that use the Checksum extension to verify and confirm the integrity of chunks that were not received in full by the Server - thereby avoiding that data that was already transferred intact needs to be sent again.

Feedback welcome!

- The base64 encoded parts of the example requests and responses is invalid fake data at that point
- add context to the example requests/responses
…artial-Checksum and Upload-Partial-Checksum-Range to simplify the implementation and specification

- edit and partially rewrite for clarity and consistency with the rest of the specification
- include POST requests into requests that can initiate Partial Checksum usage
Copy link
Contributor

@butonic butonic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! IMO this nicely extends the core resumable upload of TUS with checksum support.

protocol.md Outdated Show resolved Hide resolved

This extension allows Clients that use the Checksum extension to verify and confirm the
integrity of chunks that were not received in full by the Server - thereby avoiding that
intact data needs to be sent again.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this be changed to also allow for requests that are received in full but where the checksum does not match? E.g. if the file has a Upload-Length of 100 bytes with the first 50 already written and the client uploads the next 50 but with an invalid checksum, should we include the headers to let the client re-calculate the checksum and not have to resend the 50? I would argue that this makes the server side implementation easier as we do not have to differentiate between if the client disconnected or not.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Partial Checksum extension builds on top of the Checksum extension, which stays intact for completed requests.

In your example the PATCH request for bytes 50-100 would - to satisfy the Checksum extension requirements - already include an Upload-Checksum header with the checksum of those 50 bytes.

Per the Checksum extension, the Server would check if the checksum for those 50 bytes would match what the Client sent in the Upload-Checksum header. If it does not, the Server would respond with a 460 Checksum Mismatch error and discard the 50 bytes. Which I'd argue is the right thing to do when the Checksum for the fully received data doesn't match: the Client should still come up with the same checksum as the one it sent originally - and if it is tolerant to checksum errors, it could probably skip using the Checksum and Partial Checksum extensions altogether.

From the server-side, I'd think differentiating between partial/interrupted and fully received requests should be relatively simple, as the server can assume a partial/interrupted request when the received request body's byte count is less than the number in the request's Content-Length header - and a fully received request if byte count and header value match.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do now how the checksum extension work. ;) My comment was more to highlight a different angle on how it could be implemented. As long as it's clear how this extension behaves when both this and checksum is supported I don't think there is an issue here. The use case proposed (for invalid checksum) is very niche and just an idea.

protocol.md Outdated Show resolved Hide resolved

If a Server receives an incomplete `PATCH` or `POST` request with `Upload-Checksum` header, it
MAY store the incompletely received chunk's data (hereafter referred to as *partial data*) in
a separate location instead of discarding it.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another idea for implementation here (this is nitpicking but anyway) is that the server stores the data in the actual file but keeps some kind of marker which bytes are "partial" or "not validated" or similar. This would reduce reduce the number of writes needed for copying the data from one location to another.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The text only says store … in a separate location, which could be the same file, or a separate file.

Thinking about it for a bit, the way to implement this with the least filesystem / copy overhead would likely be to simply keep writing all received data to a file, use file truncation as needed to discard data or partial data with bad checksums, and keep track of offsets and partial segments through a separate (JSON) file.

Writing that metadata to a chunk at the end of the file could be even more efficient, but could leave you with invalid data if there's a power outage or crash before the metadata chunk could be written to disk. An outdated JSON file, OTOH, would just prompt the client to upload the possibly imcompletely written part of the data again.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's why I said "nitpicking" :) It was meant to just bring other solutions for the record in case people read this and interpret is as "store in a different file on disk" which was my first interpretation.

protocol.md Outdated Show resolved Hide resolved
protocol.md Outdated

If the Server receives a `HEAD` request on the upload thereafter, it MAY compute a checksum
on the partial data and then provide the result as well as the range of the partial data within
the upload through the `Upload-Partial-Checksum` and `Upload-Partial-Checksum-Range`

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"the partial data within the upload" reads a bit hard to me. Not sure on a better copy though.

Copy link
Author

@felix-schwarz felix-schwarz Mar 11, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After some text massaging, I arrived at this:

If the Server receives a HEAD request on the upload thereafter, it SHOULD compute a checksum
on the partial data and then return it, alongside the partial data's start and end offsets relative to
the beginning of the file, in the Upload-Partial-Checksum and Upload-Partial-Checksum-Range
response headers respectively.

What do you think?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Much clearer! Nice.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy you like it!


If the checksums don't match, the Client MUST continue the upload from the `Upload-Offset`
returned by the `HEAD` request, at which point the Server MAY discard any partial data it may
still be holding.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this section here is very clear. Nicely done! Would this be correct from a servers perspective just to clarify that I understand this correctly? The pretense is that the client started a PATCH request with a Upload-Checksum header and then disconnected:

  • The HEAD request will contain Upload-Partial-Checksum and the Upload-Partial-Checksum-Range headers.
  • In the next PATCH request, if the Upload-Partial-Checksum matches the bytes for the file specified in Upload-Partial-Checksum-Range then the new Upload-Offset is the upload offset before the disconnected PATCH request + the range of the range header. If the Upload-Partial-Checksum header is not the correct checksum (or is missing) then the Upload-Offset is the offset from before the disconnected PATCH request and the server could safely remove the partial data.
  • Do we wish to just discard the partial data if the Upload-Partial-Checksum is invalid or do we wish to return an error to the client? Scratch this, it is answered below. We will return a 409 Conflict.

Copy link
Author

@felix-schwarz felix-schwarz Mar 11, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • The HEAD request will contain Upload-Partial-Checksum and the Upload-Partial-Checksum-Range headers.

If you mean that these headers would be contained in the response to the HEAD request, then: yes, that's exactly how it is meant to work.

(The HEAD request itself would not contain those headers.)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes sorry! Obviously meant the HEAD response :)

protocol.md Outdated
If a Server receives `Upload-Partial-Checksum` and `Upload-Partial-Checksum-Range`
headers as part of a `PATCH` request and they match its own values for partial data stored
for the upload, the Server MUST append it to the upload and move its internal upload offset
forward accordingly, before processing the request any further.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find this part a bit hard to read. Basically what you are saying is that if the Upload-Partial-Checksum header matches for the range specified in Upload-Partial-Checksum-Range then the server should no longer count this as partial data but instead as part of the file (either by copying it to the actual file or removing a marker or similar)?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, exactly that.

I tried to make it a bit easier to read and arrived at this (in the next commit):

If a Server receives Upload-Partial-Checksum and Upload-Partial-Checksum-Range
headers as part of a PATCH request and they match the Server's computed results for its stored
partial data for the upload, the Server MUST append the partial data to the upload and move its
internal upload offset forward accordingly, before processing the request any further.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Much better, thank you.

the partial data's range, it SHOULD dispose of the partial data.

Clients receiving a `409 Conflict` status in response SHOULD send a new `HEAD` request to the
Server to determine the upload's status before sending the next `PATCH` request.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to clarify: This second HEAD request would also include the Upload-Partial-Checksum-X headers so that the client can try again? Is there a time limit here or is it forever for a specific file?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, the server would already have to have disposed of the partial data at that point:

If the Server has disposed of the partial data or if the Client sends range and checksum values
that do not match those of the Server
for the partial data, the Server MUST dispose of the
partial data
(if any) and MUST respond with a 409 Conflict status.

With partial data gone, the Server also shouldn't respond with any Upload-Partial-Checksum-X headers anymore.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Makes sense. Thanks for clarifying!

protocol.md Outdated Show resolved Hide resolved
@smatsson
Copy link

I like this extension! To be honest I think this is what the original checksum extension should have looked like. Hindsight is 20/20 I guess :) Naming is hard and I found the name of the extension and headers a bit confusing. Will have to think about other names though as nothing comes to mind right now.

@Acconut
Copy link
Member

Acconut commented Mar 11, 2021

First of all, thank you for this PR! I have currently a lot on my plate and am a bit overworked, so I don't know yet when I have time to look into this.

@felix-schwarz
Copy link
Author

I like this extension! To be honest I think this is what the original checksum extension should have looked like. Hindsight is 20/20 I guess :) Naming is hard and I found the name of the extension and headers a bit confusing. Will have to think about other names though as nothing comes to mind right now.

Happy you like it! 😀

As with the last extensions I'm completely open to alternative name suggestions. As the extension adds kind of a commit/rollback mechanism, I originally thought of "Checksum Commit", but then found it hard to use the same headers for HEAD response and PATCH request, since only the latter would actually "commit" the partial data.

@felix-schwarz
Copy link
Author

First of all, thank you for this PR! I have currently a lot on my plate and am a bit overworked, so I don't know yet when I have time to look into this.

Thanks for letting me know - and I hope you can find some time to rest and recharge soon.

- massaging of several paragraphs to make them easier to read
@smatsson
Copy link

I like this extension! To be honest I think this is what the original checksum extension should have looked like. Hindsight is 20/20 I guess :) Naming is hard and I found the name of the extension and headers a bit confusing. Will have to think about other names though as nothing comes to mind right now.

Happy you like it! 😀

As with the last extensions I'm completely open to alternative name suggestions. As the extension adds kind of a commit/rollback mechanism, I originally thought of "Checksum Commit", but then found it hard to use the same headers for HEAD response and PATCH request, since only the latter would actually "commit" the partial data.

I've been thinking about the name for a bit and think that first of all we should switch the name around to have "checksum" first. This follows the pattern for other sub extensions (e.g. creation/creation-with-upload and concatenation/concatenation-unfinished). Not sure if "checksum-partial" is a good name though as it's not the checksum in itself that is partial but rather the data. I like the idea of "checksum-commit" but I feel like it might be a bit confusing.

Some suggestions (and why they might be OK):

  • checksum-unfinished - There is data written that have not been finalized yet, since we cannot validate the checksum.
  • checksum-unverified - Same reason as above
  • checksum-partial-data - It's the checksum extension but with support for partial data on disconnects.

What do you think @felix-schwarz ?

@felix-schwarz
Copy link
Author

felix-schwarz commented Mar 31, 2021

I've been thinking about the name for a bit and think that first of all we should switch the name around to have "checksum" first. This follows the pattern for other sub extensions (e.g. creation/creation-with-upload and concatenation/concatenation-unfinished).

Very good point!

Not sure if "checksum-partial" is a good name though as it's not the checksum in itself that is partial but rather the data. I like the idea of "checksum-commit" but I feel like it might be a bit confusing.

My biggest issue with checksum-commit was that "commit" only really works for the client to server request that "commits" the partial data, but is confusing when returned by the server in a HEAD response. And I didn't want to have two different header names whose content would be identical.

Some suggestions (and why they might be OK):

  • checksum-unfinished - There is data written that have not been finalized yet, since we cannot validate the checksum.
  • checksum-unverified - Same reason as above
  • checksum-partial-data - It's the checksum extension but with support for partial data on disconnects.

Thanks! I think checksum-unverified has the same issue as checksum-commit, just reversed: it really only works well for the HEAD response, but not for the request confirming it.

Another one I could think of is checksum-segment, because the checksum targets only a segment of the file.

Depending on choice, that'd result in these header names:

  1. Upload-Checksum-Unfinished + Upload-Checksum-Unfinished-Range: summarizes the use case of this extension quite well, but not sure about the "Checksum Unfinished" name for the extension.
  2. Upload-Checksum-Partial-Data + Upload-Checksum-Partial-Data-Range: nicely reflects the use of "partial data" in the proposed spec. Extension could be called "Checksum for Partial Data" or "Checksum with Partial Data".
  3. Upload-Checksum-Segment + Upload-Checksum-Segment-Range: similar to "partial data", but in a single word. Extension name not so obvious, however. Maybe "Checksum for Segment".

I'm fine with all of them, but if I had to pick one, it'd likely be Partial Data as it works well for both header names and extension name(s).

What do you think @smatsson?

@smatsson
Copy link

smatsson commented Apr 9, 2021

Pardon the late reply! I've been swamped the last couple of weeks.

Naming is truly hard :D I think most of these are OK and just to make it more complicated I will also throw in checksum-resumable in the mix. I'm thinking that the original checksum extension is not resumable as the data will be discarded if the checksum mismatches (I found this quite confusing when adding support to tusdotnet, #143).

I'm wondering if we have to call the extension the same name as the headers? Probably not? I mean we could call the extension "Checksum with partial data" (or "Checksum with segments" or whatever) while the headers are still called Upload-Checksum-Partial-Data-Range etc. Segment has a nice ring to it but might be confusing in regards to chunks (the words being similar, not the concept)? I agree with you that partial data is my favorite so far.

I've been meaning to write a POC on this extension (as with Client-Tag and Challenge) but just haven't gotten around to it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants