paper | year | synopsis | results | architecture |
---|---|---|---|---|
Audio Word2Vec: Unsupervised Learning of Audio Segment Representationsusing Sequence-to-sequence Autoencoder | 2016 | * from audio segments of variable length to embeddings * unsupervised, forced alignment of text to audio * training on MFCC features |
* vector representations are shown to describe the sequential phonetic structures of the audio segments to agood degree | * RNN, denoising AE |
Word Embeddings for Speech Recognition | NA | * proprietary dataset * strong ties to speech recognition * interesting architecture / way of handling data (fixed length representation of words) |
* It can be seen that, as expected,neighbors of any given word arguablysound like it | * The input for the baseline network is 26 contiguous frames (20on the left and 5 on the right to keep the latency low) of 40-dimensional log-filterbank features [8]. The log-filterbanks arecomputed every 10ms over a 25ms window. The network con-sists of eight fully connected rectified linear unit layers (so-called ReLUs) with 2560 nodes each, and a softmax layer ontop with the 14000 states as the output labels. * longer words are truncated to 2 sec and shorter words are zero padded on both ends |
DEEP CONVOLUTIONAL ACOUSTIC WORD EMBEDDINGSUSING WORD-PAIR SIDE INFORMATION | 2016 | * siamese networks for creating embeddings * trained on telling whether given audio represents the same word or not (discrimination task) |
* losses based on cosine similarity outperformed Euclidean-based losses | * Word classifier CNN:1-D convolution with 96 filters over 9frames; ReLU; max pooling over 3 units; 1-D convolution with96 filters over 8 units; ReLU; max pooling over 3 units; 1024-unitfully-connected ReLU; softmax layer over 1061 word types. * Word similarity Siamese CNN:two convolutional and max pool-ing layers as above; 2048-unit fully-connected ReLU; 1024-unitfully-connected linear linear; terminates in lossl(x1,x2). |
Audio-Linguistic Embeddings for Spoken Sentences | 2019 | * spoken sentence embeddings which capture both acoustic and linguistic content * good discussion of the pros of sentence level embeddings * multi task learning (both acoustic and linguistic decoder) * RNNs are harder to train than TCNs |
* spoken sentence embeddings outper-form phoneme and word-level baselines on speech recogni-tion and emotion recognition tasks | * temporal convolution network (any causal model will work, for example transformers) |
WAV2VEC: UNSUPERVISED PRE-TRAINING FOR SPEECH RECOGNITION | 2019 | * focus is pretraining, model is optimized to solve next time stamp prediction task * fully unsupervised, in the process creates an embedding layer * code available as part of fairseq |
* pretraining in the audio domain seems to work, here specifically for word recognition and letter recognition tasks | |
Neural Discrete Representation Learning | 2018 | * VAE with discreet representation for latent structure learning * successfuly trained across multiple modalities (images, audio, video) |
* For instance, when trained on speech we discover the latent structure of language without any supervision or prior knowledge about phonemes or words |
|
Representation Learning with Contrastive Predictive Coding | 2019 | * librispeech processed using KALDI, dataset available on google drive * predictive coding - learning latent representation via predicting future states * not sure I fully understand - they are backpropagating the loss from predicted future latent vectors? |
||
Mockingjay: Unsupervised Speech Representation Learning with Deep Bidirectional Transformer Encoders | 2019 | * bidirectional Transformer encoder pre-trained on a large amount of unlabeled speech | * unsupervised pretraining for transformers is strikingly effective! ("In a low resource setting with only 0.1% of labeled data, we outperform the result of Mel-features that uses all 100% labeled data.") | * "Previous speech representation methods learn through conditioning on past frames and predicting information about future frames. Whereas Mockingjay is designed to predict the current frame through jointly conditioning on both past and future contexts." |
LEARNING HIERARCHICAL DISCRETE LINGUISTIC UNITS FROM VISUALLY-GROUNDED SPEECH |
2019 | * very neat architecture for learning meaning of word and subword like segments (depending on where the quantization layer is inserted) * PyTorch code available on github |
||
Language Transfer of Audio Word2Vec: Learning Audio Segment Representations without Target Language Data | 2017 | * an evaluation of feasibility of transfer learning with word2vec | * word2vec lends itself well to transfer learning across languages - this raises concerns, further suggests word2vec are acoustic embeddings that do not capture semantic similarity well | |
Learning Word Embeddings from Speech | 2017 | * very elegant architecture for learning semantically meaningful word embeddings from audio! | * The biggest advantage of the proposed model is its capability of extracting semantic information of audio segments taken directly from raw speech, without relying on any other modalities such as text or images, which are challenging and expensive to collect and annotate. |