Use s3 for temporary storage #177

slifty · 2023-05-19T15:35:31Z

This PR replaces the use of local storage for temporary upload progress with S3.

By making this change we prevent the risk of local file storage being filled.

Eventually the SDK and permanent backend will support multi-part upload, at which point we will be able to remove the S3 dependencies / direct integration with S3.

Resolves #145

src/classes/SftpSessionHandler.ts

src/classes/TemporaryFile.ts

nfebe

Looks good!

cecilia-donnelly · 2023-05-23T16:02:55Z

Tagged @liam-lloyd since he has been looking at multipart upload too.

src/classes/TemporaryFile.ts

We were using the local file system for storing files temporarily as they were being uploaded to the SFTP service. We knew this was not a long term solution for several reasons, including the security risk of a denial of service by filling up the disk space. We're moving to the long term solution of using S3 directly for that upload. We can't use the existing upload-service and permanent API for this because, at least for now, it requires us to know the size of the file AND does not allow us to upload over multiple parts. For this reason we're integrating with S3 directly. Issue #145 A user could create a DOS by uploading files that are too large

nfebe · 2023-05-30T21:55:16Z

src/index.ts

+if (process.env.TEMPORARY_FILE_S3_BUCKET === undefined) {
+  throw new SystemConfigurationError('TEMPORARY_FILE_S3_BUCKET must be populated in order to upload to s3.');
+}
+if (process.env.AWS_ACCESS_KEY_ID === undefined) {
+  throw new SystemConfigurationError('AWS_ACCESS_KEY_ID must be populated in order to upload to s3.');
+}
+if (process.env.AWS_ACCESS_SECRET === undefined) {
+  throw new SystemConfigurationError('AWS_ACCESS_SECRET must be populated in order to upload to s3.');
+}
+if (process.env.TEMPORARY_FILE_S3_BUCKET_REGION === undefined) {
+  throw new SystemConfigurationError('TEMPORARY_FILE_S3_BUCKET_REGION must be populated in order to upload to s3.');
+}
+


This is exactly the same check in the TemporaryFile class. Can you wrap this check in a function and call it in these two places?

Not a hill I would die on though.

slifty · 2023-06-01T18:02:18Z

I was about to write in the issue but realized the PR is probably the best place for it at this point -- In testing this out locally I realized that the "local buffer" implementation doesn't quite work out with the way sftp sends data (and, in fact, wouldn't have worked even without the local buffer, and maybe wouldn't even have worked necessarily in the original implementation!).

The SFTP protocol includes an offset:

`offset' is the offset (in bytes) from the
   beginning of the file where to start writing, and `data' is the data
   to be written.

   The write will extend the file if writing beyond the end of the file.
   It is legal to write way beyond the end of the file; the semantics
   are to write zeroes from the end of the file to the specified offset
   and then the data.  On most operating systems, such writes do not
   allocate disk space but instead leave "holes" in the file.

Which is to say, writes are done in parallel AND are not guaranteed to be sequential. I'm currently working to figure out a good way to handle this.

slifty · 2023-06-01T18:48:03Z

It looks like the SFTP protocol (v3) does not provide a way to signal to clients that writes must be sequential.

I think this means that, well, we need to support non-sequential writes and know how to honor the offset.

The trouble here is we now have some conflicting constratings:

We want a finite cap on the amount of local data that can be stored per connection.
S3 requires us to write 5MB chunks.

If data is being written sequentially we have no issue: just append to a single local temporary file and when it gets large enough, upload it.

Since data is NOT being written sequentially, we have no way of ensuring that we will get a 5MB chunk.

For example, one could imagine imagine an SFTP client that breaks a 100MB file into twenty separate 5MB segments, and sends parts of each segment in a round robin style. This would functionally result in storing all 100MB in local temporary storage before writing anything to S3!

It does appear that rclone is writing sequentially, by the way, it's just that we need to appreciate that this is not guaranteed.

I'm going to reflect on alternate approaches given these revelations. One approach could be to use s3 for even the "temporary" storage as a series of individual, small files that are indexed by their offsets and ultimately get coalesced. I am concerned that this will lead to massive overhead in creating so many tiny S3 objects (and I am also unsure if S3 prices per object, which I could see making this tactic very expensive).

Another approach could be to simply implement our current plan, knowing that generally clients are generally writing sequentially. We could also add limits to the allowed local storage (but those limits might be higher than 5MB). This could represent a "sweet spot" of preventing DDOS risk but also could lead to confusing file upload failures for users if their SFTP client has an obnoxious write fragmentation algorithim

slifty · 2023-06-01T20:55:47Z

Marking this as draft for now since there have been enough odd curve balls that it warrants stepping back and making sure this is still the approach we want to take.

I also just spoke with @kfogel and I'm going to at least update the current (full tmp storage) implementation to properly account for offset.

slifty commented May 19, 2023

View reviewed changes

src/classes/SftpSessionHandler.ts Outdated Show resolved Hide resolved

slifty requested review from cecilia-donnelly and nfebe May 22, 2023 13:44

nfebe reviewed May 23, 2023

View reviewed changes

src/classes/TemporaryFile.ts Outdated Show resolved Hide resolved

nfebe approved these changes May 23, 2023

View reviewed changes

cecilia-donnelly requested a review from liam-lloyd May 23, 2023 16:02

liam-lloyd reviewed May 23, 2023

View reviewed changes

src/classes/TemporaryFile.ts Show resolved Hide resolved

src/classes/TemporaryFile.ts Outdated Show resolved Hide resolved

slifty force-pushed the 145-use-s3-for-temp-storage branch 4 times, most recently from 249938b to a6ae32c Compare May 25, 2023 17:15

slifty force-pushed the 145-use-s3-for-temp-storage branch from a6ae32c to 2f0b78e Compare May 25, 2023 17:19

slifty requested review from nfebe and liam-lloyd May 25, 2023 17:20

liam-lloyd approved these changes May 30, 2023

View reviewed changes

nfebe reviewed May 30, 2023

View reviewed changes

nfebe approved these changes May 30, 2023

View reviewed changes

slifty marked this pull request as draft June 1, 2023 20:54

cecilia-donnelly removed their request for review June 5, 2023 21:28

slifty closed this Aug 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use s3 for temporary storage #177

Use s3 for temporary storage #177

slifty commented May 19, 2023 •

edited

Loading

nfebe left a comment

cecilia-donnelly commented May 23, 2023

nfebe May 30, 2023

slifty commented Jun 1, 2023 •

edited

Loading

slifty commented Jun 1, 2023

slifty commented Jun 1, 2023

Use s3 for temporary storage #177

Use s3 for temporary storage #177

Conversation

slifty commented May 19, 2023 • edited Loading

nfebe left a comment

Choose a reason for hiding this comment

cecilia-donnelly commented May 23, 2023

nfebe May 30, 2023

Choose a reason for hiding this comment

slifty commented Jun 1, 2023 • edited Loading

slifty commented Jun 1, 2023

slifty commented Jun 1, 2023

slifty commented May 19, 2023 •

edited

Loading

slifty commented Jun 1, 2023 •

edited

Loading