For my detailed observations and analysis follow this word document - Dissecting Image Generation.docx
This project focuses on image generation using Stable Diffusion and ControlNet, guided by depth maps and Canny edges. The objective is to critique various conditioning techniques (depth maps, Canny edges) to produce the best possible output images. Additionally, this project explores the impact of different aspect ratios and generation latency.
- Python 3.9 or later
- PyTorch 1.11.0+
- Transformers (for ControlNet)
- Diffusers
- OpenCV
- Matplotlib
- Skimage
# Install PyTorch and torchvision (for GPU version, make sure CUDA is installed)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
# Install Diffusers for Stable Diffusion and ControlNet
pip install diffusers transformers accelerate
pip install timm
# Install PIL (Pillow) for image processing
pip install pillow
pip install diffusers transformers accelerate
# Install OpenCV for Canny edge detection
pip install opencv-python
# Install Matplotlib (optional, for plotting images)
pip install matplotlib
- ControlNet Model:
lllyasviel/control_v11f1p_sd15_depth
- Stable Diffusion Checkpoint:
runwayml/stable-diffusion-v1-5
For this task, I used the provided depth maps and applied various conditioning techniques such as Canny edges to enhance the output. I experimented with different configurations to generate the "best" possible images.
- The number of inference steps significantly impacts both the quality and time taken to generate images.
- 25 steps: Provided faster results, but the image quality was lower than 50 or 100 steps.
- 50 steps: Achieved a balance between speed and image quality.
- 100 steps: Produced the most detailed images but required much longer generation times.
- Image generated in 25, 50, 100 steps
Prompt = "beautiful landscape, mountains in the background." Prompt = "luxurious bedroom interior." Prompt = "room with chair." Prompt = "house in the forest."
In this task, I explored the impact of aspect ratio on image quality by generating images in 1:1 and 4:3 aspect ratios.
- The depth map image
nocrop.png
was resized to both 1:1 and 4:3 aspect ratios. - I also cropped the original image to these aspect ratios to compare the visual differences between resizing and cropping.
- 1:1 Aspect Ratio: Maintains a balanced composition, but resizing may lead to distortion in some regions.
- 4:3 Aspect Ratio: Provides a wider field of view but introduces some stretching when resized. Cropping yielded better results for preserving the visual quality.
This task evaluates the time taken to generate images and explores ways to reduce latency.
-
25 steps: Faster but lower-quality images.
-
50 steps: Provides a balance between speed and quality.
-
100 steps: Best image quality, but the generation time is significantly longer.
- Model Quantization: By converting the model to INT8 precision, we can speed up inference without significantly compromising image quality.
- Scheduler Tuning: We experimented with different schedulers (DDIM, LMS, Euler) to reduce inference time.
- Low-Resolution Images: Reducing image resolution (e.g., 256x256) can decrease the overall generation time.
Image generation (25 steps) took 5.42 seconds.
Image generation (50 steps) took 10.82 seconds.
Image generation (100 steps) took 20.57 seconds.
- Depth Map vs Depth Map + Canny Edges:
- The combination of depth maps and Canny edges provides sharper, more detailed images compared to using depth maps alone.
- Inference Steps:
- Higher inference steps (50 or 100) provide better quality, but with a significant increase in generation time.
- Aspect Ratio Differences:
- 1:1 vs 4:3 aspect ratios produced different compositions. The 1:1 aspect ratio provided a more balanced image, while 4:3 gave a broader view.
- Resized vs Cropped:
- Cropped images maintained visual quality better than resized images.
- Latency Optimization:
- Reducing Inference steps and reducing image resolution helped reduce generation time, with a slight impact on image quality.
The project demonstrates how depth maps and Canny edges can be effectively used to guide image generation with Stable Diffusion and ControlNet. Higher inference steps produce better quality images, but they also significantly increase the generation time. Using techniques like INT8 quantization and reducing image resolution can optimize the image generation process while still maintaining acceptable quality. Resizing and cropping images to different aspect ratios also provided interesting insights on composition and quality.