replace datasvc with a tus.io capable endpoint #284

butonic · 2019-10-02T20:18:18Z

We currently use the datasvc to upload files. It directly streams the PUT request body sent from the ocdavsvc to the storage drivers. This is the good part. The rest is far from scalable:

The eos driver currently writes a temp file because the eosclient uses xrdcopy to copy that into eos. It should use the Range defined PUT requests to stream.
If we implement chunking (no matter if that is old chunking or new chunking) ocdavsvc needs to write a temporary file, because it can only send a PUT to the datasvc.
- For eos that leads to writing the file 3 times:
- first in ocdavsvc when receiving individual chunks,
- then as a temp file by the eos driver and,
- finally, when copying it with xrdcopy to eos.
- For owncloud / local we still need to write 2 copies:
  - first in ocdavsvc when receiving individual chunks,
  - finally, when wrriting the file with the owncloud / local driver
- For s3
  - first in ocdavsvc when receiving individual chunks,
  - then as a temp file by the s3 driver and,
  - finally, when sending it as multipart upload to s3.
    The bottleneck is the datasvc which forces a single file to be transferred.

I propose to replace the datasvc with a tusd based implementation.

https://tus.io is an openprotocol specification how to do resumable uploads
it supports extensions and currently describes creation, expiration, checksum, termination and concatenation which allows parallel uploads. We could define a bulk or batch extension to upload multiple small files in one request.
tusd is the go reference implementation (MIT licensed)
- It supports single file uploads with a single request, ~~even though that is not yet added to the spec: Add Creation With Upload extension tus/tus-resumable-upload-protocol#88~~, protocol PR [just got merged]. This works and creates a file without having to send subsequent PATCH requests:
```
curl -X POST localhost:9997/tus/random-0 -v \
  -u aaliyah_abernathy:secret \
  -H "Tus-Resumable: 1.0.0" \
  -H "Content-Type: application/offset+octet-stream" \
  -H "Upload-Metadata: filename d29ybGRfZG9taW5hdGlvbl9wbGFuLnBkZg==" \
  -H "Upload-Offset: 0" \
  -H "Upload-Length: 40" \
  -d "1234567890123456789012345678901234567890"
```
- it supports parallel uploads which can be used to pass through the owncloud chunking from ocdavsvc and s3 multipart uploads
the handlers can be reused and we can use our own way of routing requests.
- we could add a handler for old PUT requests as they are handled by the datasvc
It can be extended with custom Filestore implementations (eg for eos, owncloud / lcocal or s3 )
It has a hook system that we can use to trigger work flows when a file has finished
It supports different locking implementations
Supports HTTP/2

Really a compelling protocol, IMO.

But for reva it would have some consequences:

clients would need to use the tus protocol to upload files if they want to directly use CS3
- this makes tus in effect the upload protocol for CS3
- several client implementations exist: go, js, java, .net, android, ios, pythen, php and even bash
- this is a good thing IMO
we can encapsulate workflows as part of the upload process. tusd alread supports multiple ways of executing hooks, eg sending http requests or executing shell scripts...
- This would contain the worker queue to the tusdsvc ... locking is also thought of.
the Upload() function of the current storage drivers has to be replaced with storage specific tusd Filestore implementations. As a result, file uploads will no longer go through the storage drivers, unless we make them aware to new files. Which is something we need to do anyway to pick up file changes that bypass reva, eg via ssh.

Open questions:

How do we get the CS3 fileid/reference after a file has finished uploading? We could use a hook to generate one ... or it can be done with the storage specific Filestore ...
How can we get progress information? It does support HEAD requests on the upload resource ... we could expos workflow progress? maybe describe a workflow or progress extension?
will cross storage moves be affected? yes, they will use the tus protocol as well ... parrallel transfer would become possible...
do we need a throttling extension? maybe?
can we implement fetching of chunks while the file has not yet finished uploading? Maybe. The tusd service supports GET requests, but I don't know if it allows them for unfinished uploads. it should be fossible if we do some bookkeeping of which bytes have been uploaded. we can prevent chunks with maybe a get-after-workflow extension that encrypts the stream before sending it to the server. the uploading client sends the decryption key so the server can execute the workflow and release the key to clients that have downloaded the encrypted file. they now only need to decrypt it. ... well ... only decreases latency anyway ... only works if the workflow does not change the file ... which is ok I guess ...
can we add zsync support to this? Yes, AFAICT.
What about encryption? e2e will pass through. server side / at rest should be handled by the storage drivers?

Overall, I think this will clarify the responsibilities and makes a LOT of sense because it takes away many of the decisions we still need to make.

The text was updated successfully, but these errors were encountered:

refs · 2019-10-04T08:25:57Z

Seems like tackling this at the protocol level is a compelling solution. @jfd do you have a branch published I could watch?

butonic · 2019-10-04T14:56:57Z

List of things I noticed during the implementation:

Use the Request context instead of a blank one tus/tusd#309
tusd docs need to reflect correct package

labkode · 2019-10-08T14:11:58Z

Before committing, decide on the protocol here #290

PVince81 · 2020-05-07T10:57:52Z

any update ? how much of this ticket has been accomplished already ?

This was referenced Oct 4, 2019

[WIP] Use tus for uploads (and support range requests for downloads?) #286

Closed

Implement chunked uploading with lazy ops header owncloud/ocis#36

Closed

butonic mentioned this issue Oct 9, 2019

Decide on HTTP chunked-data transfer protocol #290

Open

butonic closed this as completed May 6, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

replace datasvc with a tus.io capable endpoint #284

replace datasvc with a tus.io capable endpoint #284

butonic commented Oct 2, 2019 •

edited

Loading

refs commented Oct 4, 2019

butonic commented Oct 4, 2019

labkode commented Oct 8, 2019

PVince81 commented May 7, 2020

replace datasvc with a tus.io capable endpoint #284

replace datasvc with a tus.io capable endpoint #284

Comments

butonic commented Oct 2, 2019 • edited Loading

refs commented Oct 4, 2019

butonic commented Oct 4, 2019

labkode commented Oct 8, 2019

PVince81 commented May 7, 2020

butonic commented Oct 2, 2019 •

edited

Loading