Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DataPipe] key renamer #402

Open
wants to merge 5 commits into
base: main
Choose a base branch
from
Open

[DataPipe] key renamer #402

wants to merge 5 commits into from

Conversation

tmbdev
Copy link
Contributor

@tmbdev tmbdev commented May 13, 2022

This PR adds a filter that allows keys to be renamed in training samples represented as dictionaries. This is particularly useful for webdataset-style data sets, but can also be used with other dictionary iterators.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 13, 2022
Copy link
Contributor

@NivekT NivekT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, the implementation makes sense but it feels slightly specific. I will let others chime in on their thoughts. Aside from that, just a few nit comments:

It is worth mentioning that we generally name our DataPipe following this convention:
https://github.com/pytorch/data/blob/9ad73d85b398828a919aef692c27f77182716aa1/docs/source/tutorial.rst#naming

I should add that to the contributing guide.

torchdata/datapipes/iter/util/renamekeys.py Show resolved Hide resolved
torchdata/datapipes/iter/util/renamekeys.py Outdated Show resolved Hide resolved
@NivekT NivekT changed the title key renamer [DataPipe] key renamer May 19, 2022
torchdata/datapipes/iter/util/renamekeys.py Show resolved Hide resolved
test/test_iterdatapipe.py Outdated Show resolved Hide resolved
output = list(iter(stage2))
assert len(output) == 2
assert set(output[0].keys()) == set(["t", "b"])

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please test other boolean flags.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@VitalyFedyunin
Copy link
Contributor

Please switch the order of inputs 'pattern' -> 'new name' looks more natural

@tmbdev
Copy link
Contributor Author

tmbdev commented May 20, 2022

The usual usage is with keyword arguments using a simple key as output and a pattern as input. It also parallels assignment. I think this order is more useful. What do you think?

@VitalyFedyunin
Copy link
Contributor

In my opinion it makes sense to have two datapipes:
pattern_filter_keys -> takes patterns, throws away all missmatch keys #406
and
pattern_rename_keys -> takes pattern->new_name dictionary and renames keys accordingly. In this case they will follow same API patterns and would be easy to remember.


def __init__(
self,
source_datapipe: IterDataPipe[List[Union[Dict, List]]],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

source_datapipe: IterDataPipe[Dict[Any, Any]]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function isn't primarily as a general stream transformation, but as a quick, simple, and readable way of extracting fields for further processing in a data pipeline. That is, usually this is used for getting tar files with different file name patterns and The current API addresses that use case really well. I would recommend keeping it as it is.

Copy link
Contributor

@NivekT NivekT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, it looks pretty good! Just a few comments. Thanks for the PR!

Comment on lines +27 to +29
keep_unselected: keep keys/value pairs even if they don't match any pattern (False)
must_match: all key value pairs must match (True)
duplicate_is_error: it is an error if two renamings yield the same key (True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Should we move these after *args?

WebDatasetIterDataPipe as WebDataset,
)
from torchdata.datapipes.iter.util.renamekeys import (
KeyRenamerIterDataPipe as RenameKeys,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
KeyRenamerIterDataPipe as RenameKeys,
KeyRenamerIterDataPipe as KeyRenamer,

*args,
keep_unselected=False,
must_match=True,
duplicate_is_error=True,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
duplicate_is_error=True,
allow_duplicate=False,

nit: might be a better name but feel free to ignore

source_datapipe: a DataPipe yielding a stream of dictionaries.
keep_unselected: keep keys/value pairs even if they don't match any pattern (False)
must_match: all key value pairs must match (True)
duplicate_is_error: it is an error if two renamings yield the same key (True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
duplicate_is_error: it is an error if two renamings yield the same key (True)
duplicate_is_error: it is an error if two renamings yield the same key (True); otherwise the first matched one will be returned

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants