Code for running the expriments presented in:
Look, Listen, and Learn More: Design Choices for Deep Audio Embeddings
Jason Cramer, Ho-Hsiang Wu, Justin Salamon and Juan Pablo Bello
Under review, 2018.
For the pre-trained embedding models (openL3), please go to: github.com/marl/openl3
This repository contains an implementation of the model proposed in Look, Listen and Learn (Arandjelović, R., Zisserman, A. 2017). This model uses videos to learn vision and audio features in an unsupervised fashion by training the model for the proposed Audio-Visual Correspondence (AVC) task. This task tries to determine whether a piece of audio and an image frame come from the same video and occur simulatneously.
Dependencies
- Python 3 (we use 3.6.3)
- ffmpeg
- sox
- TensorFlow (follow instructions carefully, and install before other Python dependencies)
- keras (follow instructions carefully!)
- Other Python dependencies can by installed via
pip install -r requirements.txt
The code for the model and training implementation can be found in l3embedding/
. Note that the metadata format expected is the same used in AudioSet (Gemmeke, J., Ellis, D., et al. 2017), as training this model on AudioSet was one of the goals for this implementation.
You can train an AVC/embedding model using train.py
. Run python train.py -h
to read the help message regarding how to use the script.
There is also a module classifier/
which contains code to train a classifier using that uses extracts embeddings on new audio using the embedding model. Currently this only supports using the UrbanSound8K dataset (Salamon, J., Jacoby, C., Bello, J. 2014)
You can train an urban sound classification model using train_classifier.py
. Run python train_classifier.py -h
to read the help message regarding how to use the script.
cd ./resources/vggish
curl -O https://storage.googleapis.com/audioset/vggish_model.ckpt
curl -O https://storage.googleapis.com/audioset/vggish_pca_params.npz
cd ../..
If you use a SLURM environment, sbatch
scripts are available in jobs/
.