Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GCS Folder Fix #656

Merged
merged 2 commits into from
Sep 13, 2023
Merged

GCS Folder Fix #656

merged 2 commits into from
Sep 13, 2023

Conversation

varunshenoy
Copy link
Contributor

@varunshenoy varunshenoy commented Sep 13, 2023

A quick fix to ensure that hf_cache works with gcs folders, e.g. gcs://llama-2-7b/folder_a/folder_b etc. The main change is using the split_gcs_logic from serving_image_builder.py in cache_warmer.py.

(truss-py3.9) @varunshenoy ➜ /workspaces/truss (varun/gcs-folder-fix) $ truss image build  examples/working-vllm-gcs/
[+] Building 37.9s (57/57) FINISHED                                                                                                                 docker:default
 => [internal] load .dockerignore                                                                                                                             0.2s
 => => transferring context: 2B                                                                                                                               0.0s
 => [internal] load build definition from Dockerfile                                                                                                          0.2s
 => => transferring dockerfile: 5.27kB                                                                                                                        0.0s
 => [internal] load metadata for docker.io/baseten/vllm:v0.3                                                                                                  0.4s
 => [internal] load metadata for docker.io/library/python:3.11-slim                                                                                           0.4s
 => [internal] load build context                                                                                                                             0.1s
 => => transferring context: 6.92kB                                                                                                                           0.0s
 => [vllm  1/24] FROM docker.io/baseten/vllm:v0.3@sha256:999d3ec4405c02227df87dd4a57b2a9b9494ff7c0116eb13ebe8141916c9b1c0                                     0.0s
 => [cache_warmer  1/27] FROM docker.io/library/python:3.11-slim@sha256:9bd704d713fde6cebdd54779c121da9c3ddd28808797e4f93d58af0050e8ba70                      0.0s
 => CACHED [vllm  2/24] RUN apt-get update && DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends         curl nginx supervisor &&      0.0s
 => CACHED [vllm  3/24] COPY ./proxy.conf /etc/nginx/conf.d/proxy.conf                                                                                        0.0s
 => CACHED [vllm  4/24] RUN mkdir -p /var/log/supervisor                                                                                                      0.0s
 => CACHED [cache_warmer  2/27] RUN mkdir -p /app/hf_cache                                                                                                    0.0s
 => CACHED [cache_warmer  3/27] WORKDIR /app                                                                                                                  0.0s
 => CACHED [cache_warmer  4/27] COPY ./data/service_account.json /app/data/service_account.json                                                               0.0s
 => CACHED [cache_warmer  5/27] RUN apt-get -y update; apt-get -y install curl; curl -s https://baseten-public.s3.us-west-2.amazonaws.com/bin/b10cp-5fe8dc7d  0.0s
 => CACHED [cache_warmer  6/27] COPY ./cache_requirements.txt /app/cache_requirements.txt                                                                     0.0s
 => CACHED [cache_warmer  7/27] RUN pip install -r /app/cache_requirements.txt --no-cache-dir && rm -rf /root/.cache/pip                                      0.0s
 => CACHED [cache_warmer  8/27] COPY ./cache_warmer.py /cache_warmer.py                                                                                       0.0s
 => [vllm  5/24] COPY supervisord.conf /etc/supervisor/conf.d/supervisord.conf                                                                                0.2s
 => [cache_warmer  9/27] RUN python3 /cache_warmer.py folder_a/258661883-4285d010-845c-4540-878f-f87c31702a69.png gs://llama-2-7b/folder_a                    1.2s
 => [cache_warmer 10/27] RUN python3 /cache_warmer.py folder_a/6257300d3296d266ba1bde0d_BasetenLogoCentered.png gs://llama-2-7b/folder_a                      1.0s
 => [cache_warmer 11/27] RUN python3 /cache_warmer.py folder_a/baseten/0.png gs://llama-2-7b/folder_a                                                         1.0s
 => [cache_warmer 12/27] RUN python3 /cache_warmer.py folder_a/baseten/1.png gs://llama-2-7b/folder_a                                                         1.0s
 => [cache_warmer 13/27] RUN python3 /cache_warmer.py folder_a/baseten/10.png gs://llama-2-7b/folder_a                                                        1.1s
 => [cache_warmer 14/27] RUN python3 /cache_warmer.py folder_a/baseten/11.png gs://llama-2-7b/folder_a                                                        1.0s
 => [cache_warmer 15/27] RUN python3 /cache_warmer.py folder_a/baseten/12.png gs://llama-2-7b/folder_a                                                        1.0s
 => [cache_warmer 16/27] RUN python3 /cache_warmer.py folder_a/baseten/13.png gs://llama-2-7b/folder_a                                                        1.0s
 => [cache_warmer 17/27] RUN python3 /cache_warmer.py folder_a/baseten/14.png gs://llama-2-7b/folder_a                                                        1.1s
 => [cache_warmer 18/27] RUN python3 /cache_warmer.py folder_a/baseten/2.png gs://llama-2-7b/folder_a                                                         1.0s
 => [cache_warmer 19/27] RUN python3 /cache_warmer.py folder_a/baseten/3.png gs://llama-2-7b/folder_a                                                         1.1s
 => [cache_warmer 20/27] RUN python3 /cache_warmer.py folder_a/baseten/4.png gs://llama-2-7b/folder_a                                                         1.1s
 => [cache_warmer 21/27] RUN python3 /cache_warmer.py folder_a/baseten/5.png gs://llama-2-7b/folder_a                                                         1.0s
 => [cache_warmer 22/27] RUN python3 /cache_warmer.py folder_a/baseten/6.png gs://llama-2-7b/folder_a                                                         1.0s
 => [cache_warmer 23/27] RUN python3 /cache_warmer.py folder_a/baseten/7.png gs://llama-2-7b/folder_a                                                         1.1s
 => [cache_warmer 24/27] RUN python3 /cache_warmer.py folder_a/baseten/8.png gs://llama-2-7b/folder_a                                                         1.0s
 => [cache_warmer 25/27] RUN python3 /cache_warmer.py folder_a/baseten/9.png gs://llama-2-7b/folder_a                                                         1.0s
 => [cache_warmer 26/27] RUN python3 /cache_warmer.py folder_a/baseten/baseten.png gs://llama-2-7b/folder_a                                                   1.0s
 => [cache_warmer 27/27] RUN python3 /cache_warmer.py folder_a/baseten/output.mp4 gs://llama-2-7b/folder_a                                                    1.0s
 => [vllm  6/24] COPY --from=cache_warmer /app/hf_cache/llama-2-7b/folder_a/258661883-4285d010-845c-4540-878f-f87c31702a69.png /app/hf_cache/llama-2-7b/fold  0.3s
 => [vllm  7/24] COPY --from=cache_warmer /app/hf_cache/llama-2-7b/folder_a/6257300d3296d266ba1bde0d_BasetenLogoCentered.png /app/hf_cache/llama-2-7b/folder  0.2s
 => [vllm  8/24] COPY --from=cache_warmer /app/hf_cache/llama-2-7b/folder_a/baseten/0.png /app/hf_cache/llama-2-7b/folder_a/baseten/0.png                     0.2s
 => [vllm  9/24] COPY --from=cache_warmer /app/hf_cache/llama-2-7b/folder_a/baseten/1.png /app/hf_cache/llama-2-7b/folder_a/baseten/1.png                     0.2s
 => [vllm 10/24] COPY --from=cache_warmer /app/hf_cache/llama-2-7b/folder_a/baseten/10.png /app/hf_cache/llama-2-7b/folder_a/baseten/10.png                   0.2s
 => [vllm 11/24] COPY --from=cache_warmer /app/hf_cache/llama-2-7b/folder_a/baseten/11.png /app/hf_cache/llama-2-7b/folder_a/baseten/11.png                   0.2s
 => [vllm 12/24] COPY --from=cache_warmer /app/hf_cache/llama-2-7b/folder_a/baseten/12.png /app/hf_cache/llama-2-7b/folder_a/baseten/12.png                   0.2s
 => [vllm 13/24] COPY --from=cache_warmer /app/hf_cache/llama-2-7b/folder_a/baseten/13.png /app/hf_cache/llama-2-7b/folder_a/baseten/13.png                   0.2s
 => [vllm 14/24] COPY --from=cache_warmer /app/hf_cache/llama-2-7b/folder_a/baseten/14.png /app/hf_cache/llama-2-7b/folder_a/baseten/14.png                   0.2s
 => [vllm 15/24] COPY --from=cache_warmer /app/hf_cache/llama-2-7b/folder_a/baseten/2.png /app/hf_cache/llama-2-7b/folder_a/baseten/2.png                     0.2s
 => [vllm 16/24] COPY --from=cache_warmer /app/hf_cache/llama-2-7b/folder_a/baseten/3.png /app/hf_cache/llama-2-7b/folder_a/baseten/3.png                     0.2s
 => [vllm 17/24] COPY --from=cache_warmer /app/hf_cache/llama-2-7b/folder_a/baseten/4.png /app/hf_cache/llama-2-7b/folder_a/baseten/4.png                     0.2s
 => [vllm 18/24] COPY --from=cache_warmer /app/hf_cache/llama-2-7b/folder_a/baseten/5.png /app/hf_cache/llama-2-7b/folder_a/baseten/5.png                     0.2s
 => [vllm 19/24] COPY --from=cache_warmer /app/hf_cache/llama-2-7b/folder_a/baseten/6.png /app/hf_cache/llama-2-7b/folder_a/baseten/6.png                     0.2s
 => [vllm 20/24] COPY --from=cache_warmer /app/hf_cache/llama-2-7b/folder_a/baseten/7.png /app/hf_cache/llama-2-7b/folder_a/baseten/7.png                     0.2s
 => [vllm 21/24] COPY --from=cache_warmer /app/hf_cache/llama-2-7b/folder_a/baseten/8.png /app/hf_cache/llama-2-7b/folder_a/baseten/8.png                     0.2s
 => [vllm 22/24] COPY --from=cache_warmer /app/hf_cache/llama-2-7b/folder_a/baseten/9.png /app/hf_cache/llama-2-7b/folder_a/baseten/9.png                     0.2s
 => [vllm 23/24] COPY --from=cache_warmer /app/hf_cache/llama-2-7b/folder_a/baseten/baseten.png /app/hf_cache/llama-2-7b/folder_a/baseten/baseten.png         0.2s
 => [vllm 24/24] COPY --from=cache_warmer /app/hf_cache/llama-2-7b/folder_a/baseten/output.mp4 /app/hf_cache/llama-2-7b/folder_a/baseten/output.mp4           0.2s
 => exporting to image                                                                                                                                       12.4s
 => => exporting layers                                                                                                                                      12.4s
 => => writing image sha256:69c38bafc11426903f4038ec32ab639fd6f23e411913ea386a894edd00378ab7                                                                  0.0s
 => => naming to docker.io/library/custom-model:latest               

@varunshenoy varunshenoy requested a review from bolasim September 13, 2023 18:31
@bolasim bolasim requested a review from squidarth September 13, 2023 18:34
Copy link
Collaborator

@bolasim bolasim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add the output of the build to the description of the PR?

repo_name = repo_name.replace("gs://", "")
cache_dir = Path(f"/app/hf_cache/{repo_name}")
cache_dir = Path(f"/app/hf_cache/{bucket_name}")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the problem with doing this is that if you add multiple repo_id entries with the same bucket but different prefixes, they'll overwrite each other. Might be worth thinking about in a followup, but lets get this out to fix the issue now though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so. Different prefixes will fall under different folders in the same bucket. I can test this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[+] Building 38.9s (63/63) FINISHED                                                                                                                 docker:default
 => [internal] load build definition from Dockerfile                                                                                                          0.2s
 => => transferring dockerfile: 5.68kB                                                                                                                        0.0s
 => [internal] load .dockerignore                                                                                                                             0.2s
 => => transferring context: 2B                                                                                                                               0.0s
 => [internal] load metadata for docker.io/baseten/vllm:v0.3                                                                                                  0.7s
 => [internal] load metadata for docker.io/library/python:3.11-slim                                                                                           0.6s
 => [auth] baseten/vllm:pull token for registry-1.docker.io                                                                                                   0.0s
 => [auth] library/python:pull token for registry-1.docker.io                                                                                                 0.0s
 => [internal] load build context                                                                                                                             0.1s
 => => transferring context: 6.92kB                                                                                                                           0.0s
 => [vllm  1/26] FROM docker.io/baseten/vllm:v0.3@sha256:999d3ec4405c02227df87dd4a57b2a9b9494ff7c0116eb13ebe8141916c9b1c0                                     0.0s
 => [cache_warmer  1/29] FROM docker.io/library/python:3.11-slim@sha256:9bd704d713fde6cebdd54779c121da9c3ddd28808797e4f93d58af0050e8ba70                      0.0s
 => CACHED [vllm  2/26] RUN apt-get update && DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends         curl nginx supervisor &&      0.0s
 => CACHED [vllm  3/26] COPY ./proxy.conf /etc/nginx/conf.d/proxy.conf                                                                                        0.0s
 => CACHED [vllm  4/26] RUN mkdir -p /var/log/supervisor                                                                                                      0.0s
 => CACHED [vllm  5/26] COPY supervisord.conf /etc/supervisor/conf.d/supervisord.conf                                                                         0.0s
 => CACHED [cache_warmer  2/29] RUN mkdir -p /app/hf_cache                                                                                                    0.0s
 => CACHED [cache_warmer  3/29] WORKDIR /app                                                                                                                  0.0s
 => CACHED [cache_warmer  4/29] COPY ./data/service_account.json /app/data/service_account.json                                                               0.0s
 => CACHED [cache_warmer  5/29] RUN apt-get -y update; apt-get -y install curl; curl -s https://baseten-public.s3.us-west-2.amazonaws.com/bin/b10cp-5fe8dc7d  0.0s
 => CACHED [cache_warmer  6/29] COPY ./cache_requirements.txt /app/cache_requirements.txt                                                                     0.0s
 => CACHED [cache_warmer  7/29] RUN pip install -r /app/cache_requirements.txt --no-cache-dir && rm -rf /root/.cache/pip                                      0.0s
 => CACHED [cache_warmer  8/29] COPY ./cache_warmer.py /cache_warmer.py                                                                                       0.0s
 => [cache_warmer  9/29] RUN python3 /cache_warmer.py folder_b/image_2.png gs://llama-2-7b/folder_b                                                           1.1s
 => [cache_warmer 10/29] RUN python3 /cache_warmer.py folder_b/image_3.png gs://llama-2-7b/folder_b                                                           1.0s
 => [cache_warmer 11/29] RUN python3 /cache_warmer.py folder_a/258661883-4285d010-845c-4540-878f-f87c31702a69.png gs://llama-2-7b/folder_a                    1.1s
 => [cache_warmer 12/29] RUN python3 /cache_warmer.py folder_a/6257300d3296d266ba1bde0d_BasetenLogoCentered.png gs://llama-2-7b/folder_a                      1.2s
 => [cache_warmer 13/29] RUN python3 /cache_warmer.py folder_a/baseten/0.png gs://llama-2-7b/folder_a                                                         1.1s
 => [cache_warmer 14/29] RUN python3 /cache_warmer.py folder_a/baseten/1.png gs://llama-2-7b/folder_a                                                         1.1s
 => [cache_warmer 15/29] RUN python3 /cache_warmer.py folder_a/baseten/10.png gs://llama-2-7b/folder_a                                                        1.0s
 => [cache_warmer 16/29] RUN python3 /cache_warmer.py folder_a/baseten/11.png gs://llama-2-7b/folder_a                                                        1.0s
 => [cache_warmer 17/29] RUN python3 /cache_warmer.py folder_a/baseten/12.png gs://llama-2-7b/folder_a                                                        1.1s
 => [cache_warmer 18/29] RUN python3 /cache_warmer.py folder_a/baseten/13.png gs://llama-2-7b/folder_a                                                        1.0s
 => [cache_warmer 19/29] RUN python3 /cache_warmer.py folder_a/baseten/14.png gs://llama-2-7b/folder_a                                                        1.0s
 => [cache_warmer 20/29] RUN python3 /cache_warmer.py folder_a/baseten/2.png gs://llama-2-7b/folder_a                                                         1.0s
 => [cache_warmer 21/29] RUN python3 /cache_warmer.py folder_a/baseten/3.png gs://llama-2-7b/folder_a                                                         1.0s
 => [cache_warmer 22/29] RUN python3 /cache_warmer.py folder_a/baseten/4.png gs://llama-2-7b/folder_a                                                         1.0s
 => [cache_warmer 23/29] RUN python3 /cache_warmer.py folder_a/baseten/5.png gs://llama-2-7b/folder_a                                                         1.1s
 => [cache_warmer 24/29] RUN python3 /cache_warmer.py folder_a/baseten/6.png gs://llama-2-7b/folder_a                                                         1.1s
 => [cache_warmer 25/29] RUN python3 /cache_warmer.py folder_a/baseten/7.png gs://llama-2-7b/folder_a                                                         1.0s
 => [cache_warmer 26/29] RUN python3 /cache_warmer.py folder_a/baseten/8.png gs://llama-2-7b/folder_a                                                         1.0s
 => [cache_warmer 27/29] RUN python3 /cache_warmer.py folder_a/baseten/9.png gs://llama-2-7b/folder_a                                                         1.1s
 => [cache_warmer 28/29] RUN python3 /cache_warmer.py folder_a/baseten/baseten.png gs://llama-2-7b/folder_a                                                   1.1s
 => [cache_warmer 29/29] RUN python3 /cache_warmer.py folder_a/baseten/output.mp4 gs://llama-2-7b/folder_a                                                    1.0s
 => [vllm  6/26] COPY --from=cache_warmer /app/hf_cache/llama-2-7b/folder_b/image_2.png /app/hf_cache/llama-2-7b/folder_b/image_2.png                         0.2s
 => [vllm  7/26] COPY --from=cache_warmer /app/hf_cache/llama-2-7b/folder_b/image_3.png /app/hf_cache/llama-2-7b/folder_b/image_3.png                         0.3s
 => [vllm  8/26] COPY --from=cache_warmer /app/hf_cache/llama-2-7b/folder_a/258661883-4285d010-845c-4540-878f-f87c31702a69.png /app/hf_cache/llama-2-7b/fold  0.2s
 => [vllm  9/26] COPY --from=cache_warmer /app/hf_cache/llama-2-7b/folder_a/6257300d3296d266ba1bde0d_BasetenLogoCentered.png /app/hf_cache/llama-2-7b/folder  0.2s
 => [vllm 10/26] COPY --from=cache_warmer /app/hf_cache/llama-2-7b/folder_a/baseten/0.png /app/hf_cache/llama-2-7b/folder_a/baseten/0.png                     0.2s
 => [vllm 11/26] COPY --from=cache_warmer /app/hf_cache/llama-2-7b/folder_a/baseten/1.png /app/hf_cache/llama-2-7b/folder_a/baseten/1.png                     0.2s
 => [vllm 12/26] COPY --from=cache_warmer /app/hf_cache/llama-2-7b/folder_a/baseten/10.png /app/hf_cache/llama-2-7b/folder_a/baseten/10.png                   0.2s
 => [vllm 13/26] COPY --from=cache_warmer /app/hf_cache/llama-2-7b/folder_a/baseten/11.png /app/hf_cache/llama-2-7b/folder_a/baseten/11.png                   0.2s
 => [vllm 14/26] COPY --from=cache_warmer /app/hf_cache/llama-2-7b/folder_a/baseten/12.png /app/hf_cache/llama-2-7b/folder_a/baseten/12.png                   0.2s
 => [vllm 15/26] COPY --from=cache_warmer /app/hf_cache/llama-2-7b/folder_a/baseten/13.png /app/hf_cache/llama-2-7b/folder_a/baseten/13.png                   0.2s
 => [vllm 16/26] COPY --from=cache_warmer /app/hf_cache/llama-2-7b/folder_a/baseten/14.png /app/hf_cache/llama-2-7b/folder_a/baseten/14.png                   0.2s
 => [vllm 17/26] COPY --from=cache_warmer /app/hf_cache/llama-2-7b/folder_a/baseten/2.png /app/hf_cache/llama-2-7b/folder_a/baseten/2.png                     0.2s
 => [vllm 18/26] COPY --from=cache_warmer /app/hf_cache/llama-2-7b/folder_a/baseten/3.png /app/hf_cache/llama-2-7b/folder_a/baseten/3.png                     0.2s
 => [vllm 19/26] COPY --from=cache_warmer /app/hf_cache/llama-2-7b/folder_a/baseten/4.png /app/hf_cache/llama-2-7b/folder_a/baseten/4.png                     0.2s
 => [vllm 20/26] COPY --from=cache_warmer /app/hf_cache/llama-2-7b/folder_a/baseten/5.png /app/hf_cache/llama-2-7b/folder_a/baseten/5.png                     0.2s
 => [vllm 21/26] COPY --from=cache_warmer /app/hf_cache/llama-2-7b/folder_a/baseten/6.png /app/hf_cache/llama-2-7b/folder_a/baseten/6.png                     0.2s
 => [vllm 22/26] COPY --from=cache_warmer /app/hf_cache/llama-2-7b/folder_a/baseten/7.png /app/hf_cache/llama-2-7b/folder_a/baseten/7.png                     0.2s
 => [vllm 23/26] COPY --from=cache_warmer /app/hf_cache/llama-2-7b/folder_a/baseten/8.png /app/hf_cache/llama-2-7b/folder_a/baseten/8.png                     0.2s
 => [vllm 24/26] COPY --from=cache_warmer /app/hf_cache/llama-2-7b/folder_a/baseten/9.png /app/hf_cache/llama-2-7b/folder_a/baseten/9.png                     0.2s
 => [vllm 25/26] COPY --from=cache_warmer /app/hf_cache/llama-2-7b/folder_a/baseten/baseten.png /app/hf_cache/llama-2-7b/folder_a/baseten/baseten.png         0.2s
 => [vllm 26/26] COPY --from=cache_warmer /app/hf_cache/llama-2-7b/folder_a/baseten/output.mp4 /app/hf_cache/llama-2-7b/folder_a/baseten/output.mp4           0.2s
 => exporting to image                                                                                                                                       10.4s
 => => exporting layers                                                                                                                                      10.4s
 => => writing image sha256:74bcf30bdf2fa389257c20f6901a4fbe8723e4825231b5f70cb132318ce48ce9                                                                  0.0s
 => => naming to docker.io/library/custom-model:latest        

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Works with multiple folders:

  • gs://llama-2-7b/folder_a
  • gs://llama-2-7b/folder_b

@@ -114,6 +114,9 @@ def list_bucket_files(bucket_name, data_dir, is_trusted=False):

all_objects = []
for blob in blobs:
# leave out folders
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why? Why shouldn't I put a shorter prefix and include nested folders?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The folders will be included anyways. If you have folder_a/ with an item folder_a/image.png, you can just copy over folder_a/image.png instead of copying both.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a comment at the top of this function giving an example of the output would be helpful to help the reader understand why leaving out folders is safe

def download_file(
repo_name, file_name, revision_name=None, key_file="/app/data/service_account.json"
):
# Check if repo_name starts with "gs://"
if "gs://" in repo_name:
# Create directory if not exist
bucket_name, _ = split_gs_path(repo_name)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i wonder if it's better to strip the repo_name when templating out the command we're passing in instead of parsing it differently here.

@varunshenoy varunshenoy merged commit 39a790d into main Sep 13, 2023
10 checks passed
@varunshenoy varunshenoy deleted the varun/gcs-folder-fix branch September 13, 2023 20:12
Copy link
Collaborator

@squidarth squidarth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

couple comments around readability

@@ -32,21 +32,36 @@ def _download_from_url_using_b10cp(
)


def split_gs_path(gs_path):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider organizing the code a little differently:

class Repo:
    @staticmethod
    def from_string(repo_id: str) -> "Repo":
         if gsc, return GcsRepo(), else if hugging face return HuggingFaceRepo()


class GcsRepo(Repo):
    bucket: str
    prefix: str

class HuggingFaceRepo(Repo):
    ...

@@ -114,6 +114,9 @@ def list_bucket_files(bucket_name, data_dir, is_trusted=False):

all_objects = []
for blob in blobs:
# leave out folders
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a comment at the top of this function giving an example of the output would be helpful to help the reader understand why leaving out folders is safe

@@ -114,6 +114,9 @@ def list_bucket_files(bucket_name, data_dir, is_trusted=False):

all_objects = []
for blob in blobs:
# leave out folders
if blob.name[-1] == "/":
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider using pathlib.Path to check whether this is a folder or not, I think it would help the readability here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants