PLEASE NOTE: This project has now transformed into the work being done here. Please head over there for current status.
This is the main repository for the Implementing and extending unsupervised human-human text/audio translation project.
How does this fit into the ESP roadmap towards translating animal communication? Unsupervised audio-to-audio translation requires learning how to create useful semantic embeddings directly from audio, which allows for coorelation with other behavioral models or comparison across species.
Goal: achieving unsupervised audio to audio translation
- Build text embeddings and demonstrate translation without rosetta stone
- Good opportunity to test and demonstrate embedding alignment (this technique is what we will want to leverage once we obtain audio embeddings).
- Findings can lend themselves well to clarifying our approach and also to sharing the efficacy of embeddings with a broader public
- Implement Audio Word2Vec - train acoustic embeddings using a denoising AE architecture
- Good opportunity to get acquainted with the LibriSpeech dataset
- These embeddings could lend themselves well to comparison against semantic embeddings
- Speech2Vec: A Sequence-to-Sequence Framework for Learning Word Embeddings from Speech - train semantic embeddings using an RNN encoder-decoder architecture
- This architecture, or one of similar capability, is one we want to leverage for unsupervised audio to audio translation
- Reproduce #3 with a transformer architecture (tentative)
- Transformers offer an ease of training, they can be trained efficiently on vast amounts of data
- Obtain or synthesize a bilingual dataset of speech, train semantic word embeddings (using #3 or #4) and perform unsupervised translation (using #1).
For an overview of papers and other resources that inspire us and that we feel are instrumental to this work, please take a look at our bookshelf for this project here.