Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create a performant, cloud-agnostic way to download & upload files to cloud buckets. #256

Open
alxmrs opened this issue Oct 17, 2022 · 4 comments

Comments

@alxmrs
Copy link
Collaborator

alxmrs commented Oct 17, 2022

See discussion here: #254 (comment)

To investigate:

  • Can we do better than shutil.copyfileobj?
  • What are the optimal chunk sizes?
  • Can we copy data in parallel?
  • Are there optimizations we can do for large files?

One idea that @bahmandar has explored is calling gsutils in a subprocess (the CLI is really efficient at file transfer).

@bahmandar
Copy link
Collaborator

Here is the location for using gcloud for gcs files:
https://github.com/bahmandar/weather-tools/blob/mv-faster/weather_mv/loader_pipeline/sinks.py#L416

Fall back is shutil for downloading and fall back for remote is using apache beam file systems.

I also have a different shutil optimized for gcs:
https://github.com/bahmandar/weather-tools/blob/mv-faster/weather_mv/loader_pipeline/sinks.py#L402
One thing to note it seems like it is beneficial to change the buffer for gcs io and shutil together than just one of them.

@alxmrs
Copy link
Collaborator Author

alxmrs commented Nov 28, 2022

Here are some advantages of just using gsutil vs a hand-rolled python solution:

  • over a size threshold, gsutil will automatically parallelize file transfer
  • gsutil uses checksums to verify the integrity of data transferred, and will automatically retry on corrupted data
  • the default dataflow image already has gcloud installed, so in theory it's easy to manage this dependency
  • we get to make use of heavily invested code (maybe some magic constants found from trial & error) from GCP + boto devs
  • we get all these features with a slick one-liner: subprocess.run(f'gsutil cp {src!r} {dst!r}', shell=True, check=True)

alxmrs added a commit that referenced this issue Dec 2, 2022
A partial implementation of #256. My intention here is to see if this can speed up weather-dl requests.
@alxmrs
Copy link
Collaborator Author

alxmrs commented Dec 2, 2022

Thanks @mahrsee1997 for pointing this out! https://cloud.google.com/blog/products/storage-data-transfer/new-gcloud-storage-enables-super-fast-data-transfers/

With a 10GB file, gcloud storage was 94% faster than gsutil on download and 57% faster on upload.

alxmrs added a commit that referenced this issue Dec 6, 2022
A partial implementation of #256. Here, we copy data using `gsutil cp` instead of a python routine. This speeds things up, since `gsutil` will parallelize uploads of large files. 

* weather-dl now uses `gsutil cp` for file upload.

A partial implementation of #256. My intention here is to see if this can speed up weather-dl requests.

* Temporary: no gsutil version.

* Bump weather-dl version.

* pinning gsutil version.

* Use gcloud alpha storage cp, which is even faster :)

* Set up gcloud sdk, accounting for runtime auth issue.

* Added error handling to the subprocess call for copying.

Co-authored-by: Rahul Mahrsee <86819420+mahrsee1997@users.noreply.github.com>

* fix: added import.

* Changing subprocess invocation to be more secure.

Thanks @shoyer.

Co-authored-by: Stephan Hoyer <shoyer@google.com>

* nit: dst, not dest.

* nit: remove gcloud pip dependency.

* Using gsutil for now until we upgrade project deps.

Co-authored-by: Rahul Mahrsee <86819420+mahrsee1997@users.noreply.github.com>
Co-authored-by: Stephan Hoyer <shoyer@google.com>
@alxmrs
Copy link
Collaborator Author

alxmrs commented Dec 6, 2022

@mahrsee1997 did some benchmarking of different cloud utilities to see what would be the fastest. Our results show that gsutil is the best fit for us. However! – It possible that gcloud alpha storage would be faster if we upgraded to the latest version of gcloud. Using this version of the SDK requires that we update the versions of all our GCP dependencies. This is something that we'll tackle, but in a future PR. #265 is still a great win.

I did the bench-marking on a file of size ~18.42 GiB & it appears that "gsutil" is the most efficient approach here. It's ~77% reduction in time than our original approach of shutil.

=========================================
gcloud alpha storage – 1st run: 6.48 minutes & 2nd run: 6.53 minutes
2022-12-04 00:46:35.871 GMT
[licence4.0] Uploading to store for 'gs://$BUCKET/tmp2/rahul/aplha-storage-20-00:00:00z-tprate.gb'.
2022-12-04 00:53:04.977 GMT
[licence4.0] Upload to store complete for 'gs://$BUCKET/tmp2/rahul/aplha-storage-20-00:00:00z-tprate.gb'.

—----------------------------------------------------------
gsutil – 1st run : 3.82 minutes & 2nd run : 4.72 minutes
2022-12-04 01:24:48.613 GMT
[licence4.0] Uploading to store for 'gs://$BUCKET/tmp2/rahul/gsutil-20-00:00:00z-tprate.gb'.
2022-12-04 01:28:37.283 GMT
[licence4.0] Upload to store complete for 'gs://$BUCKET/tmp2/rahul/gsutil-20-00:00:00z-tprate.gb'.

—---------------------------------------------------------------------------------
storage-client – 7.5 minutes
2022-12-04 08:07:27.727 GMT
[licence4.0] Uploading to store for 'gs://$BUCKET/tmp2/rahul/storage-client-20-00:00:00z-tprate.gb'.
2022-12-04 08:14:57.435 GMT
[licence4.0] Upload to store complete for 'gs://$BUCKET/tmp2/rahul/storage-client-20-00:00:00z-tprate.gb'.

—---------------------------------------------------------
shutil – 16.75 minutes
2022-12-04 08:07:18.234 GMT
[licence4.0] Uploading to store for 'gs://$BUCKET/tmp2/rahul/shutil-20-00:00:00z-tprate.gb'.
2022-12-04 08:24:03.314 GMT
[licence4.0] Upload to store complete for 'gs://$BUCKET/tmp2/rahul/shutil-20-00:00:00z-tprate.gb'.)

deepgabani8 pushed a commit that referenced this issue Dec 9, 2022
A partial implementation of #256. Here, we copy data using `gsutil cp` instead of a python routine. This speeds things up, since `gsutil` will parallelize uploads of large files. 

* weather-dl now uses `gsutil cp` for file upload.

A partial implementation of #256. My intention here is to see if this can speed up weather-dl requests.

* Temporary: no gsutil version.

* Bump weather-dl version.

* pinning gsutil version.

* Use gcloud alpha storage cp, which is even faster :)

* Set up gcloud sdk, accounting for runtime auth issue.

* Added error handling to the subprocess call for copying.

Co-authored-by: Rahul Mahrsee <86819420+mahrsee1997@users.noreply.github.com>

* fix: added import.

* Changing subprocess invocation to be more secure.

Thanks @shoyer.

Co-authored-by: Stephan Hoyer <shoyer@google.com>

* nit: dst, not dest.

* nit: remove gcloud pip dependency.

* Using gsutil for now until we upgrade project deps.

Co-authored-by: Rahul Mahrsee <86819420+mahrsee1997@users.noreply.github.com>
Co-authored-by: Stephan Hoyer <shoyer@google.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants