Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consolidate block fetch requests. #1733

Merged
merged 8 commits into from
Oct 28, 2024
Merged

Conversation

groutr
Copy link
Contributor

@groutr groutr commented Oct 22, 2024

Implement the TODO to consolidate consecutive block fetch requests into a single fetch request.

Adapted from the itertools example found here: https://docs.python.org/2.6/library/itertools.html#examples

@martindurant
Copy link
Member

Would you mind explaining more completely what this does? It also makes that block of code look more complex (to me), so some comments inline would be appreciated.

@groutr
Copy link
Contributor Author

groutr commented Oct 22, 2024

Python's groupby works by splitting based on a change in the value of a key function. The chosen key computes the difference between an ascending range and the block numbers (which are also necessarily sorted). If the that difference changes, we know that there was a "hole" in the block numbers (ie, we already have the block cached).

Assume a file of 5 blocks, [0, 1, 2, 3, 4] where we have already cached block 2. The caller requests a byte range that covers blocks 0, 1, 3.

need = (i for i in block_range if i not in self.blocks)   # lazily evaluated to [0, 1, 3]
groupby(enumerate(need), key=lambda x: x[0]-x[1])
    Enumerating need pairs the needed blocks with an ascending range
    (0, 0), (1, 1), (2, 3)  # Note that the actual block numbers are the second values.
    Applying the key we get values: 0, 0, 1
    Groupby splits a new group on each key value change so we end up with [(0, 0), (1, 1)] and [(2, 3)] groups.
Since we enumerated need, we need to extract the second values from each group to get our blocks, 
so we map itemgetter(1) over the group.
For the first group, we have blocks [0, 1], so we compute the offset for beginning of block 0 and the end of block 1.
One call is made to fetch for both blocks 0 and 1.

The second group is just a single block, 3. One call is made to fetch for that block.

I'll add an explanation via comments.

@martindurant
Copy link
Member

You might be interested, that something similar happens in https://github.com/fsspec/filesystem_spec/blob/master/fsspec/utils.py#L528 (where we don't have blocks, but we do have byte ranges and possible gaps).

@groutr
Copy link
Contributor Author

groutr commented Oct 22, 2024

@martindurant Thank you for pointing my attention to that function. I'm not sure if the approach here is directly applicable there. I'll have to spend more time reasoning thru it.

@groutr
Copy link
Contributor Author

groutr commented Oct 24, 2024

@martindurant Would you prefer that I instead use merge_offset_ranges in this PR?

@martindurant
Copy link
Member

Would you prefer that I instead use merge_offset_ranges

Only if it's obvious how. I am not in a good position to make that choice, so long as you have considered the possibility.

@groutr
Copy link
Contributor Author

groutr commented Oct 25, 2024

@martindurant I played around with an implementation and observed that merge_offset_ranges does not sufficiently preserve enough info to adequately update the cache metadata. Once the byte ranges are merged, we lose the info about which blocks those ranges represent and it becomes much harder to update the cache metadata.

@martindurant martindurant merged commit 61b66b6 into fsspec:master Oct 28, 2024
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants