-
Notifications
You must be signed in to change notification settings - Fork 466
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Early fusion multimodal models #1904
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1904
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit d0b1ab0 with merge base eb67cc5 (): This comment was automatically generated by Dr. CI and updates every 15 minutes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for putting this up Rafi! I left some comments on the implementation, but I'll leave the state dict discussion to others as we've already chatted on this.
Thanks for the RFC, you made it very clear what's difference between early fusion and late fusion! About the design choice, I personally prefer Option 2 for the same reason as you mentioned. I think it's fine to "polluting" the decoder model forward a bit with some optional arguments for each modality. We might need something like
|
11th hour comment on the open design question: in my mind there are nonzero UX costs to either approach. If we patch the decoder embeddings to Personally I really don't like state dict hooks for the reason I described above. As soon as something (inevitably) goes wrong, it will take a lot more debugging and head-banging-against-the-wall before the user realizes that things are being swapped out under the hood. So perhaps it's no surprise, but I vote for the simple and dumb thing: just add an extra parameter to TransformerDecoder forward. I know that may be controversial, but I like doing the obvious thing, and I like to think our users would appreciate that as well. |
After extensive discussion offline, we decided to move ahead with the state dict hook approach. All current changes reflect this. |
What were the high level reasons for this, if I may ask? |
|
if len(encoders.keys()) != 1: | ||
raise ValueError( | ||
f"DeepFusionModel only supports a single encoder. Got {len(encoders.keys())} encoders." | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just wondering: why do we generalize encoder -> encoders now if we aren't ready to support multiple encoders yet anyways? Seems to me it'd be better to just make that move all at once in a separate PR. I would think we're not strictly required to have matching signatures for DeepFusion and EarlyFusion classes, is that incorrect?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was mainly to maintain a consistent API between the two. but I don't have strong opinions here. I don't see any hard requirement to make the signatures match
>>> # Load full fused checkpoints | ||
>>> model.load_state_dict(...) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I'm not sure this is especially helpful (maybe I'm missing the point though)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if the checkpoint is a single file you can load the entire model with encoders all at once. but I'm not sure if this will be the case for a model with multiple encoders or what the checkpoint UX would look like. I'm ok to remove this until we know for sure.
# [bsz, seq_len, 1] | ||
encoder_mask = (tokens == self.encoder_tokens[encoder]).unsqueeze(-1) | ||
# At locations where encoder token is found, replace with encoder embedding | ||
fused_embeds = fused_embeds.masked_scatter(encoder_mask, encoder_embeds) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I might be missing the point here.. is this changing the shape of fused_embeds?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not, it is just placing the encoder embedding vectors in each instance of that encoder token in the fused_embeds. The embeddings should have the same hidden dim but num encoder embeddings < num fused embeds
# [bsz * num_encoder_tokens, embed_dim] | ||
encoder_embeds = encoder_embeds.view(-1, embed_dim) | ||
# [bsz, seq_len, 1] | ||
encoder_mask = (tokens == self.encoder_tokens[encoder]).unsqueeze(-1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to do any validation on encoder_tokens in the model? E.g. what if we have image embeddings but there is no image token in the token sequence? Do we expect that to be handled in the dataset? If so, we should probably call it out in the documentation somewhere
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, need to think about where this should be asserted. I would say probably in the transform or the dataset. We wouldn't want to forward pass the encoder if there's nowhere to use it. Although, within a batch you can have variable number of images per sample, so one sample may have zero images and another may have two.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added a ValueError just in case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually.... sometimes the number of encoder embeddings and num encoder tokens will not match, because there could be padding images. so in the dataset is probably the best place to assert this. will update the docstring here though to call this out
Context
This enables Early Fusion models based on @pbontrager 's excellent original RFC on multimodal fusion models #1283. Since the RFC, we have already landed Deep Fusion model components. This PR discusses and implements the EarlyFusionModel component, along with testing and some lint updates.
Early fusion is simply a decoder with 1 or more extra encoders that merges their outputs with the token embeddings tables. The challenge lies in how we merge the embeddings and pass it into the decoder.
Changelog
_fusion.py
into_fusion_layers.py
,_early_fusion.py
, and_deep_fusion.py
Design
There is one design consideration I am seeking feedback on, and that is the EarlyFusionModel's usage of
self.decoder.tok_embeddings
. It accesses the decoder's token embedding table outside of the decoder forward because we need to merge the image encoder and any other modality encoder's output embeddings with the text embeddings (in this case just concatenate in sequence dimension):Now, instead of token ids, we are passing in the merged embeddings directly into the decoder. But since we already used the text-only tok_embeddings from the decoder, we need to skip it when passing in the merged embeddings for the final decoder output. There are two ways we can do this.
State dict surgery
In the current code changes and suggested by the original RFC, we can manually set
self.decoder.tok_embeddings = nn.Identity()
so that it becomes a no-op when you forward pass with merged embeddings.Additional input_embeds kwarg
We could add a new keyword argument in
TransformerDecoder
forward for input embeddings. If this is passed in, we automatically skip the token embeddings:This way we don't need any state dict hooks or decoder modifications. However, we are polluting the decoder model forward with more arguments.
Test plan