Spatial Audio Generation [Project Page]
This repository contains the source code accompanying our NIPS'18 paper:
Self-Supervised Generation of Spatial Audio for 360 Video
Pedro Morgado, Nuno Vasconcelos, Timothy Langlois, Oliver Wang.
In Neural Information Processing Systems, 2018.
[Arxiv]
@inproceedings{morgadoNIPS18,
title={Self-Supervised Generation of Spatial Audio for 360\deg Video},
author={Pedro Morgado, Nuno Vasconcelos, Timothy Langlois and Oliver Wang},
booktitle={Neural Information Processing Systems (NIPS)},
year={2018}
}
- python 2.7 (see
requirements.txt
for required packages.) - tensorflow 1.4
- youtube-dl
- ffmpeg (see my ffmpeg configuration)
- FlowNet2 (caffe)
Four datasets are described in the paper: REC-Street
, YT-Clean
, YT-Music
and YT-All
.
Since we do not own the copyright of 360 videos scraped from youtube, all datasets are released as a list of youtube links (meta/spatialaudiogen_db.lst
).
The composition of each dataset can be seen in meta/subsets/{DB}.lst
. Three train/test splits are also provided for each dataset (meta/subsets/{DB}.{TRAIN|TEST}.{SPLIT}.lst
).
>> python scrapping/download.py meta/spatialaudiogen_db.lst
Run python scrapping/download.py -h
for help.
This script uses youtube-dl
to download pre-selected audio and video formats for which the encoding scheme has been verified.
Videos are downloaded into the data/orig
directory.
Unfortunately, a small number of videos have been removed by the creators and will be skipped (36 out of 1189 at the time of writing).
(Low-resolution) Training uses low-resolution videos. If you only want to download the low resolution versions, please use the flag --low_res
. However, you still need the full resolution videos for deployment of trained models, in order to create a good looking 360 video with spatial audio.
>> python scrapping/preprocess.py meta/spatialaudiogen_db.lst
Run python scrapping/preprocess.py -h
for help.
This script pre-processes previously downloaded videos.
Video frames are resized to (224x448)
and remapped into equirectangular projection at 10
fps. Audio channels are remapped into ACN format (WYZX
) and resampled at 48000
kHz.
Preprocessed files are stored in data/preproc
(as .m4a
and .mp4
) and data/frames
(as .jpg
and .wav
).
Training, evaluation and deployment code use the data in data/frames
.
IMPORTANT: (Low-resolution downloads) If you opt to download only low-resolution videos in the previous step, please use the flag --low_res
again.
Preparing high-resolution videos for deployment Assuming you downloaded high-resolution videos (i.e. without --low_res
flag), you can preprocess them in high-resolution (1920x1080)
using the flag --prep_hr_video
. This is not required for training or evaluation, but it is recommended for deployment.
Flow: To extract flow maps, you'll need to first install FlowNet2. Refer to FlowNet2 documentation for instructions. Then, simply provide the path to the FlowNet2 basefolder through --flownet2_dir
. If no path is provided, flow computation is skipped.
Note: Downloading and preprocessing the entire dataset will take a long time, and requires >500Gb
to store the full resolution videos (originals and preprocessed). Plan accordingly!
Models pre-trained in each dataset can be downloaded from my OneDrive:
| REC-Street | YT-Clean | YT-Music | YT-All |
After downloading the .tar.gz
files, extract them into models/
directory.
To test the models without downloading the entire dataset, we provide sample pre-processed videos (link).
Download and extract the demo data into data/demo
. Then, run a pre-trained model using one of the following options.
[Heatmap Visualization] Colormap overlay with darker red indicating directions with higher audio energy.
>> python deploy.py {MODEL_DIR} data/demo/{VIDEO_DIR}/ data/demo/{VIDEO_DIR}/video-hr.mp4 -output_fn data/demo/{VIDEO_DIR}/prediction-colormap.mp4 --save_video --overlay_map
[Ambisonics] Saved with actual spatial sound. The output must be watched with headphones using an 360 video player. See below for more information (section Visualizing predictions
).
>> python deploy.py {MODEL_DIR} data/demo/{VIDEO_DIR}/ data/demo/{VIDEO_DIR}.mp4 -output_fn data/demo/{VIDEO_DIR}-output.mp4 --save_video --VR
python train.py -h
and python eval.py -h
for more info.
Example usage: Training and testing a model with an audio and rgb encoder (no flow) on REC-Street
dataset:
>> python train.py data/frames models/mymodel --subset_fn meta/subsets/REC-Street.train.1.lst --encoders audio video --batch_size 32 --n_iters 150000 --gpu 0
>> python eval.py models/mymodel --subset_fn meta/subsets/REC-Street.test.1.lst --batch_size 32 --gpu 0
To view the output videos of deploy.py
, we recommend one of the following players:
This is a web based Javascript VR video player, to play a video you can click "Choose File" for the Load local video option (ignore the local audio file box). The only caveat is that you have to press "stop" before loading the video file, otherwise the previous video continues to play. If you hear audio crackling, refresh the page. If the audio does not start with the video, the browser might be blocking content. Press "Load unsafe scripts" in your browser, and reload the video again.
The GoPro VR Player (Windows and Mac) supports ambisonics audio. You can install the player and load the video files.