EscherNet: A Generative Model for Scalable View Synthesis

Xin Kong · Shikun Liu · Xiaoyang Lyu · Marwan Taher · Xiaojuan Qi · Andrew J. Davison

Paper | Project Page

EscherNet is a multi-view conditioned diffusion model for view synthesis. EscherNet learns implicit and generative 3D representations coupled with the camera positional encoding (CaPE), allowing precise and continuous relative control of the camera transformation between an arbitrary number of reference and target views.

Install

conda env create -f environment.yml -n eschernet
conda activate eschernet

Demo

Run demo to generate randomly sampled 25 novel views from (1,2,3,5,10) reference views:

bash eval_eschernet.sh

Camera Positional Encoding (CaPE)

CaPE is applied in self/cross-attention for encoding camera pose info into transformers. The main modification is in diffusers/models/attention_processor.py.

To quickly check the implementation of CaPE (6DoF and 4DoF), run:

python CaPE.py

Training

Objaverse 1.0 Dataset

Download Zero123's Objaverse Rendering data:

wget https://tri-ml-public.s3.amazonaws.com/datasets/views_release.tar.gz

Filter Zero-1-to-3 rendered views (empty images):

cd scripts
python objaverse_filter.py --path /data/objaverse/views_release

Launch training

Configure accelerator (8 A100 GPUs, bf16):

accelerate config

Choose 4DoF or 6DoF CaPE (Camera Positional Encoding):

cd 4DoF or 6DoF

Launch training:

accelerate launch train_eschernet.py --train_data_dir /data/objectverse/views_release --pretrained_model_name_or_path runwayml/stable-diffusion-v1-5 --train_batch_size 256 --dataloader_num_workers 16 --mixed_precision bf16 --gradient_checkpointing --T_in 3 --T_out 3 --T_in_val 10 --output_dir logs_N3M3B256_SD1.5 --push_to_hub --hub_model_id ***** --hub_token hf_******************* --tracker_project_name eschernet

For monitoring training progress, we recommand wandb for its simplicity and powerful features.

wandb login

Offline mode:

WANDB_MODE=offline python xxx.py

Evaluation

We provide raw results and two checkpoints 4DoF and 6DoF for easier comparison.

Datasets

GSO Google Scanned Objects

GSO30: We select 30 objects from GSO dataset and render 25 randomly sampled novel views for each object for both NVS and 3D reconstruction evaluation.

RTMV

We use the 10 scenes from google_scanned.tar under folder 40_scenes for NVS evaluation.

NeRF_Synthetic

We use the all 8 NeRF objects for 2D NVS evaluation.

Franka16

We collected 16 real world object-centric recordings using a Franka Emika Panda robot arm with RealSense D435i Camera for real world NVS evaluation.

Text2Img

We collected Text2Img generation results from internet, Stable Diffusion XL (1 view) and MVDream (4 views: front, right, back, left) for NVS evaluation.

Novel View Synthesis (NVS)

To get 2D Novel View Synthesis (NVS) results, set cape_type, checkpoint, data_type, data_dir and run:

bash ./eval_eschernet.sh

Evaluate 2D metrics (PSNR, SSIM, LPIPS):

cd metrics
python eval_2D_NVS.py

3D Reconstruction

We firstly generate 36 novel views with data_type=GSO3D by:

bash ./eval_eschernet.sh

Then we adopt NeuS for 3D reconstruction:

export CUDA_HOME=/usr/local/cuda-11.8
pip install git+https://github.com/NVlabs/tiny-cuda-nn/#subdirectory=bindings/torch
cd 3drecon
python run_NeuS.py

Evaluate 3D metrics (Chamfer Distance, IoU):

cd metrics
python eval_3D_GSO.py

Gradio Demo

TODO.

To build locally:

python gradio_eschernet.py

Acknowledgement

We have intensively borrow codes from the following repositories. Many thanks to the authors for sharing their codes.

Citation

If you find this work useful, a citation will be appreciated via:

@article{kong2024eschernet,
    title={EscherNet: A Generative Model for Scalable View Synthesis},
  author={Kong, Xin and Liu, Shikun and Lyu, Xiaoyang and Taher, Marwan and Qi, Xiaojuan and Davison, Andrew J},
  journal={arXiv preprint arXiv:2402.03908},
  year={2024}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

EscherNet: A Generative Model for Scalable View Synthesis

Paper | Project Page

Install

Demo

Camera Positional Encoding (CaPE)

Training

Objaverse 1.0 Dataset

Launch training

Evaluation

Datasets

GSO Google Scanned Objects

RTMV

NeRF_Synthetic

Franka16

Text2Img

Novel View Synthesis (NVS)

3D Reconstruction

Gradio Demo

Acknowledgement

Citation

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
3drecon		3drecon
4DoF		4DoF
6DoF		6DoF
demo/GSO30/elephant/render_mvs_25/model		demo/GSO30/elephant/render_mvs_25/model
metrics		metrics
scripts		scripts
.gitignore		.gitignore
CaPE.py		CaPE.py
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml
eval_eschernet.py		eval_eschernet.py
eval_eschernet.sh		eval_eschernet.sh

License

kxhit/EscherNet

Folders and files

Latest commit

History

Repository files navigation

EscherNet: A Generative Model for Scalable View Synthesis

Paper | Project Page

Install

Demo

Camera Positional Encoding (CaPE)

Training

Objaverse 1.0 Dataset

Launch training

Evaluation

Datasets

GSO Google Scanned Objects

RTMV

NeRF_Synthetic

Franka16

Text2Img

Novel View Synthesis (NVS)

3D Reconstruction

Gradio Demo

Acknowledgement

Citation

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages