Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added so the embeddings are actually dropped #9

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

MarcusLoppe
Copy link
Contributor

At the moment the TextEmbeddingReturner just returns a mask where it given the % chance using cond_drop_prob, is setting some mask values to false.

This works great if the model you are working with respects the mask, however x-transformers attention does not do this since it takes the context and uses the raw context and feeds it to the k and v linear layers.

https://github.com/lucidrains/x-transformers/blob/0c6266ee44ea99a4449cd9201ba55924a6a7eae7/x_transformers/x_transformers.py#L944

kv_input = default(context, x)

q_input = x
k_input = kv_input
v_input = kv_input
r_input = x 

q = self.to_q(q_input)
k = self.to_k(k_input)
v = self.to_v(v_input) if exists(self.to_v) else k
r = self.to_r(r_input) if exists(self.to_r) else None 

@lucidrains
Copy link
Owner

oh, the text mask is actually sent here and then applied here

@lucidrains
Copy link
Owner

yea, there is some wasteful projections, even for the tokens that are masked out, but that's how most transformers are trained these days

@MarcusLoppe
Copy link
Contributor Author

oh, the text mask is actually sent here and then applied here

Correct, but the mask isn't applied everywhere :/

yea, there is some wasteful projections, even for the tokens that are masked out, but that's how most transformers are trained these days

I've tried using the cond_drop_prob but I had very little success until I did the changes in this commit.
I think it's due to these wasteful projections since that will leak information that it shouldn't have seem and poisoning the output.

Image if you have a full transformer that takes tokens and have the mask to hide some of the tokens and the decoder should reconstruct the full sequence. If some of the unmasked data would leak into the model then the training poisoned.

@lucidrains
Copy link
Owner

the mask should be applied or that would be a pretty big bug in x transformers

let me check tonight, out with 🐕

@lucidrains
Copy link
Owner

oh shoot, maybe the condition dropping isn't being applied to cached text embeddings 🤦‍♂️ will fix if not!

@MarcusLoppe
Copy link
Contributor Author

the mask should be applied or that would be a pretty big bug in x transformers

let me check tonight, out with 🐕

It's applied later on in he code but the k and v value at the start is not.
If you call it with the context and context_mask, the kv_input will be default context, then the next lines the kv_input value is pass to k and v linear layers.
This means that they get the full context, I don't see the context in previous steps that the mask get applied.

https://github.com/lucidrains/x-transformers/blob/0c6266ee44ea99a4449cd9201ba55924a6a7eae7/x_transformers/x_transformers.py#L944

    def forward(
        self,
        x,
        context = None,
        mask = None,
        context_mask = None,
        attn_mask = None,
        rel_pos = None,
        rotary_pos_emb = None,
        prev_attn = None,
        mem = None,
        mem_mask = None,
        return_intermediates = False,
        cache: Intermediates | None = None,
    ):
        b, n, h, kv_h, head_scale, num_mem_kv, device, has_context = x.shape[0], x.shape[1], self.heads, self.kv_heads, self.head_scale, self.num_mem_kv, x.device, exists(context)

        kv_input = default(context, x)

        q_input = x
        k_input = kv_input
        v_input = kv_input
        r_input = x

        if exists(mem):
            k_input, mem_packed_shape = pack([mem, k_input], 'b * d')
            v_input, _ = pack([mem, v_input], 'b * d')

        q = self.to_q(q_input)
        k = self.to_k(k_input)
        v = self.to_v(v_input) if exists(self.to_v) else k
    r = self.to_r(r_input) if exists(self.to_r) else None

@lucidrains
Copy link
Owner

later masking is fine, gradients cannot propagate to the point you are concerned about

@lucidrains
Copy link
Owner

https://github.com/lucidrains/meshgpt-pytorch/blob/main/meshgpt_pytorch/meshgpt_pytorch.py#L1362 the issue is because cached text embeds are passed through here and never cond dropped, will fix! thank you Marcus 🙏

@MarcusLoppe
Copy link
Contributor Author

later masking is fine, gradients cannot propagate to the point you are concerned about

I'm not quite sure if I'm misunderstanding but doesn't it matter that k and v which have gotten the full context?

k = self.to_k(k_input)
v = self.to_v(v_input)

https://github.com/lucidrains/meshgpt-pytorch/blob/main/meshgpt_pytorch/meshgpt_pytorch.py#L1362 the issue is because cached text embeds are passed through here and never cond dropped, will fix! thank you Marcus 🙏

I don't think that's my problem since when cond_drop_prob is None the conditioners it will use the self.cond_drop_prob which is 0.0 for me.
However it might be good to change that just in case.

@lucidrains
Copy link
Owner

it doesn't matter if they are masked out later

@MarcusLoppe
Copy link
Contributor Author

it doesn't matter if they are masked out later

You are probably correct, sorry I thought that the linear layer pays attention to the other dims in the length.

Little off topic, is there any mode or what not that can input (b, length, dim) and output another length, e.g. (b, 400,dim) to (b, 900, dim) ?
I know it works with autoregression but I'm not sure if there is any solution that can output in another length in a single pass.

@lucidrains
Copy link
Owner

@MarcusLoppe just doubled checked the code and couldn't find any errors

i think it should be working

@lucidrains
Copy link
Owner

you can pad a dimension, in your example, it would be F.pad(tensor, (0, 0, 0, 900 - 400)) if you wished it for the right hand side

@MarcusLoppe
Copy link
Contributor Author

MarcusLoppe commented Jun 9, 2024

you can pad a dimension, in your example, it would be F.pad(tensor, (0, 0, 0, 900 - 400)) if you wished it for the right hand side

Hey, i was just thinking if it was possible to modify the autoencoder to encode point clouds and then decode into a 3D mesh.
But since point clouds and triangle count isn't the same I wanted to know if it's possible to modify the output length.

I was thinking of using a point cloud autoencoder to encode a point cloud and then modify the mesh-autoencoder so it inputs the embeddings of cloud point and decodes into 3D mesh.
The only solution i can think of is using a layer that can predict the length based on encoding and the reshape the tensor, but that seems like a bad solution. What do you think?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants