- Open-Vocabulary 3D Localization. Locate anything with natural language dialog!
- Interactive Grounding. Humans will be able to chat with an agent to localize novel objects.
- [2023-09-21] We released a paper LLM-Grounder: Open-Vocabulary 3D Visual Grounding with Large Language Model as an Agent that includes more quantitative evaluations of our pipeline!
- [2023-05-31] We improved the demo by adding grounding result visualization in 3D, taking pictures in real time, and speeding up inference by parallelization. Try out the new demo!
- [2023-05-15] The first version of chat-with-nerf is available now! Please try out demo!
- A faster process to determine camera poses and rendering pictures. See discussion #15. Implemented in #17.
- Use LLaVA to replace BLIP-2 for better image captioning.
- Improve the foundation model (currently CLIP is used) used in LERF for grounding, which can potentially improve spatial and affordance understanding. Potential candidate: LLaVA, BLIP-2, OWL-ViT.
To install the dependencies we provide a Dockerfile:
docker build -t chat-with-nerf:latest .
Or if you want to pull remote image from Dockerhub to save significant time, please try:
docker pull jedyang97/chat-with-nerf:latest
Otherwise, if you prefer build it locally:
conda create --name nerfstudio -y python=3.8
conda activate nerfstudio
pip install torch==1.13.1 torchvision functorch --extra-index-url https://download.pytorch.org/whl/cu117
pip install ninja git+https://github.com/NVlabs/tiny-cuda-nn/#subdirectory=bindings/torch
pip install nerfstudio
git clone https://github.com/kerrj/lerf
python -m pip install -e .
ns-train -h
Note that specific CUDA 11.3 is required. For further information, please check nerfstudio installation guide.
Then locally you need to run
git clone https://github.com/sled-group/chat-with-nerf.git
Download and construct the llava-13b-v0 checkpoint (see LLaVA's documentation on how to construct the checkpoint). Then assuming you store the constructed llava-13b-v0
checkpoint under <my_path_to_llava>/llava-13b-v0
, move the checkpoint to /chat-with-nerf/pre-trained-weights/LLaVA
.
cd chat-with-nerf
mkdir -p pre-trained-weights/LLaVA
cd pre-trained-weights/LLaVA
mv <my_path_to_llava>/llava-13b-v0 .
Alternatively, you can supply a different version of LLaVA checkpoint and change LLAVA_PATH
's value in chat_with_nerf/settings.py
:
LLAVA_PATH = "/workspace/pre-trained-weights/LLaVA/<my_llava_checkpoint>"
Open up your directory's permission for the docker container:
cd <parent_path_chat-with-nerf>
chmod -R 777 .
If using Docker, you can use the following command to spin up a docker container with chat-with-nerf mounted under workspace
docker run --gpus "device=0" -v /<parent_path_chat-with-nerf>/:/workspace/ -v /home/<your_username>/.cache/:/home/user/.cache/ --rm -it --shm-size=12gb chat-with-nerf:latest
Then install Chat with NeRF dependencies
cd /workspace/chat-with-nerf
pip install -e .
pip install -e .[dev]
(or use your favorite virtual environment manager)
We provide the code to interactively play with our agent. To run the demo:
cd /workspace/chat-with-nerf
export $(cat .env | xargs); gradio chat_with_nerf/app.py
We provide four Jupyter notebooks to reproduce results in the paper. To run these notebooks, please refer to the Evaluation README.
To facillate easier reproduction of our results, we provide pre-processed data here.
If you would like to use your own 3D scenes, please follow the next two sections:
For extracting the openscene embeddings, we used the pre-trained Distillation model checkpoint, shared by the Openscene Authors for generating the representation. To generate the corresponding representations, kindly refer to the guidelines provided in the Openscene GitHub repository, specifically focusing on the Data Preparation and Run Sections.
https://github.com/pengsongyou/openscene#data-preparation
https://github.com/pengsongyou/openscene#run
We include a version of NeRFStudio code in our released docker and you can use generate point cloud function to acquire the H5 embedding. We slightly altered the ns-export function: https://docs.nerf.studio/reference/cli/ns_export.html to get the H5 embeddings.
@misc{yang2023llmgrounder,
title={LLM-Grounder: Open-Vocabulary 3D Visual Grounding with Large Language Model as an Agent},
author={Jianing Yang and Xuweiyi Chen and Shengyi Qian and Nikhil Madaan and Madhavan Iyengar and David F. Fouhey and Joyce Chai},
year={2023},
eprint={2309.12311},
archivePrefix={arXiv},
primaryClass={cs.CV}
}