show_progress_bar estimated time not accurate when dataset has texts of varying lengths #2940

cdfox · 2024-09-16T17:49:46Z

Due to encode() sorting the sentences in decreasing order by length, if I pass in a large list of texts with large variation in length (I'm using a model that supports up to 8192 max_seq_length, but most of the texts are much shorter than that), the initial estimate of time remaining from tqdm is way too high (e.g., 4 hours instead of 30 minutes). Perhaps when show_progress_bar=True, it would be good to print a warning that the time remaining may be overestimated due to the sort order.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

show_progress_bar estimated time not accurate when dataset has texts of varying lengths #2940

show_progress_bar estimated time not accurate when dataset has texts of varying lengths #2940

cdfox commented Sep 16, 2024

show_progress_bar estimated time not accurate when dataset has texts of varying lengths #2940

show_progress_bar estimated time not accurate when dataset has texts of varying lengths #2940

Comments

cdfox commented Sep 16, 2024