Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The number of tokens are inconsistent in get_image_embeds #29

Open
CharlesGong12 opened this issue Nov 6, 2024 · 0 comments
Open

The number of tokens are inconsistent in get_image_embeds #29

CharlesGong12 opened this issue Nov 6, 2024 · 0 comments

Comments

@CharlesGong12
Copy link

CharlesGong12 commented Nov 6, 2024

Hi,

Thanks for your good work!

In get_image_embeds of Adapter, the token length will be 256 when image_pil or image_tensor is input, while it will be 64 when image_embeds is given. The similar thing was discussed in issue #14 .

image_embeds = self.visual_encoder(image_tensor)

image_embeds's shape[1] will be 256 from code above, when image_tensor or image_pil is given. This is the case when we directly use eval_seed_x_detokenizer.py.

image_embeds = torch.cat([image_embeds, image_embeds_neg], dim=0)

However, when image_embeds is given, image_embeds's shape[1] will be 64 from code above. Because the llm's IMG tokens are set to 64. This is the case when we use llm's output to decode and get an image.

Indeed, the shape[1] of both conditions above will be 64 later, since self.resampler is called in self.encode_image_embeds, whose num_queries is 64.

Will the differences between the two cases have any influence? Or, whether 64 or 256 is used in training here

def forward(self, noisy_latents, timesteps, image_embeds, text_embeds, noise, time_ids):
?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant