Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question: Is litdata faster when loading local dataset or network storage s3 dataset? #428

Open
2catycm opened this issue Nov 30, 2024 · 6 comments
Labels
enhancement New feature or request

Comments

@2catycm
Copy link

2catycm commented Nov 30, 2024

When my storage is large enough to download the dataset locally, should I still use litdata's streaming api?

@2catycm 2catycm added the enhancement New feature or request label Nov 30, 2024
Copy link

Hi! thanks for your contribution!, great first issue!

@2catycm
Copy link
Author

2catycm commented Nov 30, 2024

Another Question,can I use sshfs instead of s3? since I don't have s3 account, but I have multiple machines. To save storage, I am wondering to store datasets on machine D, and access them from machine A, B, C. Can I use litdata to optimize this workflow?

@tchaton
Copy link
Collaborator

tchaton commented Dec 1, 2024

Hey @2catycm,

Yes, some users reported increased speed even running locally.

We don t support sssfs but it shouldn t be hard to add it if you want too. Feel free to make a PR.

Best,
T.C

@2catycm
Copy link
Author

2catycm commented Dec 2, 2024

Hey @2catycm,

Yes, some users reported increased speed even running locally.

We don t support sssfs but it shouldn t be hard to add it if you want too. Feel free to make a PR.

Best, T.C

Thanks for your reply, I am trying to use vtab-1k dataset locally and tried to use litdata to optimize it. And I found that on a subset of length 800, the speed of litdata is faster than pytorch Dataset by 1.41 times (147 ms -> 104ms).

I am not sure whether my benchmark is appropriate, since I just trivially iterate the dataset, haven't used it to train.

%%timeit
bar = tqdm(train_dataset)
for i, data in enumerate(bar):
    pass

@tchaton
Copy link
Collaborator

tchaton commented Dec 2, 2024

Hey @2catycm Yes, this is appropriate. We benchmark by iterating over the dataset 2 epochs in the cloud, one epoch locally.

@tchaton
Copy link
Collaborator

tchaton commented Dec 2, 2024

Hey @2catycm. We could probably make it slightly faster too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants