-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use s3 for temporary storage #177
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good!
Tagged @liam-lloyd since he has been looking at multipart upload too. |
249938b
to
a6ae32c
Compare
We were using the local file system for storing files temporarily as they were being uploaded to the SFTP service. We knew this was not a long term solution for several reasons, including the security risk of a denial of service by filling up the disk space. We're moving to the long term solution of using S3 directly for that upload. We can't use the existing upload-service and permanent API for this because, at least for now, it requires us to know the size of the file AND does not allow us to upload over multiple parts. For this reason we're integrating with S3 directly. Issue #145 A user could create a DOS by uploading files that are too large
a6ae32c
to
2f0b78e
Compare
if (process.env.TEMPORARY_FILE_S3_BUCKET === undefined) { | ||
throw new SystemConfigurationError('TEMPORARY_FILE_S3_BUCKET must be populated in order to upload to s3.'); | ||
} | ||
if (process.env.AWS_ACCESS_KEY_ID === undefined) { | ||
throw new SystemConfigurationError('AWS_ACCESS_KEY_ID must be populated in order to upload to s3.'); | ||
} | ||
if (process.env.AWS_ACCESS_SECRET === undefined) { | ||
throw new SystemConfigurationError('AWS_ACCESS_SECRET must be populated in order to upload to s3.'); | ||
} | ||
if (process.env.TEMPORARY_FILE_S3_BUCKET_REGION === undefined) { | ||
throw new SystemConfigurationError('TEMPORARY_FILE_S3_BUCKET_REGION must be populated in order to upload to s3.'); | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is exactly the same check in the TemporaryFile
class. Can you wrap this check in a function and call it in these two places?
Not a hill I would die on though.
I was about to write in the issue but realized the PR is probably the best place for it at this point -- In testing this out locally I realized that the "local buffer" implementation doesn't quite work out with the way sftp sends data (and, in fact, wouldn't have worked even without the local buffer, and maybe wouldn't even have worked necessarily in the original implementation!). The SFTP protocol includes an
Which is to say, writes are done in parallel AND are not guaranteed to be sequential. I'm currently working to figure out a good way to handle this. |
It looks like the SFTP protocol (v3) does not provide a way to signal to clients that writes must be sequential. I think this means that, well, we need to support non-sequential writes and know how to honor the The trouble here is we now have some conflicting constratings:
If data is being written sequentially we have no issue: just append to a single local temporary file and when it gets large enough, upload it. Since data is NOT being written sequentially, we have no way of ensuring that we will get a 5MB chunk. For example, one could imagine imagine an SFTP client that breaks a 100MB file into twenty separate 5MB segments, and sends parts of each segment in a round robin style. This would functionally result in storing all 100MB in local temporary storage before writing anything to S3! It does appear that rclone is writing sequentially, by the way, it's just that we need to appreciate that this is not guaranteed. I'm going to reflect on alternate approaches given these revelations. One approach could be to use s3 for even the "temporary" storage as a series of individual, small files that are indexed by their offsets and ultimately get coalesced. I am concerned that this will lead to massive overhead in creating so many tiny S3 objects (and I am also unsure if S3 prices per object, which I could see making this tactic very expensive). Another approach could be to simply implement our current plan, knowing that generally clients are generally writing sequentially. We could also add limits to the allowed local storage (but those limits might be higher than 5MB). This could represent a "sweet spot" of preventing DDOS risk but also could lead to confusing file upload failures for users if their SFTP client has an obnoxious write fragmentation algorithim |
Marking this as draft for now since there have been enough odd curve balls that it warrants stepping back and making sure this is still the approach we want to take. I also just spoke with @kfogel and I'm going to at least update the current (full tmp storage) implementation to properly account for offset. |
This PR replaces the use of local storage for temporary upload progress with S3.
By making this change we prevent the risk of local file storage being filled.
Eventually the SDK and permanent backend will support multi-part upload, at which point we will be able to remove the S3 dependencies / direct integration with S3.
Resolves #145