Question: Is litdata faster when loading local dataset or network storage s3 dataset? #428

2catycm · 2024-11-30T07:33:00Z

When my storage is large enough to download the dataset locally, should I still use litdata's streaming api?

github-actions · 2024-11-30T07:33:24Z

Hi! thanks for your contribution!, great first issue!

2catycm · 2024-11-30T08:58:04Z

Another Question，can I use sshfs instead of s3? since I don't have s3 account, but I have multiple machines. To save storage, I am wondering to store datasets on machine D, and access them from machine A, B, C. Can I use litdata to optimize this workflow?

tchaton · 2024-12-01T11:12:11Z

Hey @2catycm,

Yes, some users reported increased speed even running locally.

We don t support sssfs but it shouldn t be hard to add it if you want too. Feel free to make a PR.

Best,
T.C

2catycm · 2024-12-02T08:46:36Z

Hey @2catycm,

Yes, some users reported increased speed even running locally.

We don t support sssfs but it shouldn t be hard to add it if you want too. Feel free to make a PR.

Best, T.C

Thanks for your reply, I am trying to use vtab-1k dataset locally and tried to use litdata to optimize it. And I found that on a subset of length 800, the speed of litdata is faster than pytorch Dataset by 1.41 times (147 ms -> 104ms).

I am not sure whether my benchmark is appropriate, since I just trivially iterate the dataset, haven't used it to train.

%%timeit
bar = tqdm(train_dataset)
for i, data in enumerate(bar):
    pass

tchaton · 2024-12-02T08:53:34Z

Hey @2catycm Yes, this is appropriate. We benchmark by iterating over the dataset 2 epochs in the cloud, one epoch locally.

tchaton · 2024-12-02T09:37:11Z

Hey @2catycm. We could probably make it slightly faster too.

2catycm added the enhancement New feature or request label Nov 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question: Is litdata faster when loading local dataset or network storage s3 dataset? #428

Question: Is litdata faster when loading local dataset or network storage s3 dataset? #428

2catycm commented Nov 30, 2024

github-actions bot commented Nov 30, 2024

2catycm commented Nov 30, 2024

tchaton commented Dec 1, 2024

2catycm commented Dec 2, 2024

tchaton commented Dec 2, 2024 •

edited

Loading

tchaton commented Dec 2, 2024

Question: Is litdata faster when loading local dataset or network storage s3 dataset? #428

Question: Is litdata faster when loading local dataset or network storage s3 dataset? #428

Comments

2catycm commented Nov 30, 2024

github-actions bot commented Nov 30, 2024

2catycm commented Nov 30, 2024

tchaton commented Dec 1, 2024

2catycm commented Dec 2, 2024

tchaton commented Dec 2, 2024 • edited Loading

tchaton commented Dec 2, 2024

tchaton commented Dec 2, 2024 •

edited

Loading