Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Image resolution at intermediate layers? #252

Open
yousafe007 opened this issue Sep 12, 2023 · 3 comments
Open

Image resolution at intermediate layers? #252

yousafe007 opened this issue Sep 12, 2023 · 3 comments

Comments

@yousafe007
Copy link

As it is already clear, the images are resized to 224 by 224 before being fed into Dino. While doing work currently, where I use the intermediate layers' features, specifically layer 9, what is the image resolution at that layer, or any other layer for the sake of the question?

@mathildecaron31 Any help would be appreciated. :)

@tcourat
Copy link

tcourat commented Mar 5, 2024

This is a vision transformer, hence the image resolution is the same throught the whole network. There is not pooling layers like in CNN. However, each token corresponds to a patch size 8x8, hence the feature map resolution is 28x28.

@yousafe007
Copy link
Author

This is a vision transformer, hence the image resolution is the same throught the whole network. There is not pooling layers like in CNN. However, each token corresponds to a patch size 8x8, hence the feature map resolution is 28x28.

Perhaps my question was I'll-formulated. I meant the feature map, as you said. Could you tell me how you reached the number 28 through your calculation?

@tcourat
Copy link

tcourat commented Mar 5, 2024

The input image has size 224x224, hence you divide each dimension by 8 to obtain features maps of size 28x28. If you choose another patch size (different from 8x8), it may change.

If you look at the embeddings given by the model for one image, you get a tensor of shape (785,768). This is because 785=1+28*28 (there is a CLS token added in front of the 28x28=784 tokens of the feature map). 768 is the hidden dimension (at least with the vitb8 model).

If you want to obtain the "image-like" feature maps, you can get rid of the CLS token and reshape the tensor by e.g, assuming :

fmap = fmap[1:,:] # Keep every token except the first
fmap = fmap.reshape(28,28,768)

Above snipped code may change slightly if you deal with batched images (add a dimension for the batch then), or another patch size or hidden dimension depending on the model.

(Please note that I am not a creator of this github, I only provide what I understood from the architecture because I'm currently also digging into DINOV2)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants