You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hey @dome272,
Amazing work on the V2.
Looking at the code, I see that stage C is not diffusing in the latent space of EffNet, since it's shape is Bx16x24x24 and not 16*12x12 as stated in the paper. I however see that stage B uncond shape is still 16x12x12, so I'm a bit confuse with what is happening there.
Also if i understand well, Stage B is not Paella like anymore?
Will there be a V2 of the paper as well with all the changes?
Thanks!
The text was updated successfully, but these errors were encountered:
Based on the video, it sounds like Wuerstchen V2 started training with 512x512 images (12x12 latents) and then fine tuned on 1024x1024 images (24x24 latents) to get the final checkpoint.
This approach is similar to how SD team did their initial training on 256x256 images (32x32 latents), then finetuned on 512x512 images (64x64 latents), and (for SDXL) did further fine-tuning on 1024x1024-area images to get the final checkpoint.
Hey there, @madebyollin is fully right. We pretrain at 3x512x512 -> 16x12x12 and then after 500k iterations moved to 3x1024x1024 -> 16x24x24. Some great people are helping us right now rewriting the paper and bringing all the updates into an updated v2 paper. But this might still take a bit :c
But yea Stage B is a diffusion model now as well. We haven't done any comparison. It was just that Pablo initially was frustrated that the LDM Stage B would always crash and set it his goal to make it work really well. And after this was achieved we just went on with it. It would be interesting tho to make a fair comparison to the Paella architecture for Stage B. Another idea would be to discretize Stage B latents and then learn a Paella as Stage C. But we haven't done this yet.
Hey @dome272,
Amazing work on the V2.
Looking at the code, I see that stage C is not diffusing in the latent space of EffNet, since it's shape is Bx16x24x24 and not 16*12x12 as stated in the paper. I however see that stage B uncond shape is still 16x12x12, so I'm a bit confuse with what is happening there.
Also if i understand well, Stage B is not Paella like anymore?
Will there be a V2 of the paper as well with all the changes?
Thanks!
The text was updated successfully, but these errors were encountered: