Skip to content

Speech commands recognition with PyTorch | Kaggle 10th place solution in TensorFlow Speech Recognition Challenge

Notifications You must be signed in to change notification settings

tugstugi/pytorch-speech-commands

Repository files navigation

Convolutional neural networks for Google speech commands data set with PyTorch.

General

We, xuyuan and tugstugi, have participated in the Kaggle competition TensorFlow Speech Recognition Challenge and reached the 10-th place. This repository contains a simplified and cleaned up version of our team's code.

Features

  • 1x32x32 mel-spectrogram as network input
  • single network implementation both for CIFAR10 and Google speech commands data sets
  • faster audio data augmentation on STFT
  • Kaggle private LB scores evaluated on 150.000+ audio files

Results

Due to time limit of the competition, we have trained most of the nets with sgd using ReduceLROnPlateau for 70 epochs. For the training parameters and dependencies, see TRAINING.md. Earlier stopping the train process will sometimes produce a better score in Kaggle.

        Model         CIFAR10
test set
accuracy
Speech Commands
test set
accuracy
Speech Commands
test set
accuracy with crop
Speech Commands
Kaggle private LB
score
Speech Commands
Kaggle private LB
score with crop
        Remarks        
VGG19 BN 93.56% 97.337235% 97.527432% 0.87454 0.88030
ResNet32 - 96.181419% 96.196050% 0.87078 0.87419
WRN-28-10 - 97.937089% 97.922458% 0.88546 0.88699
WRN-28-10-dropout 96.22% 97.702999% 97.717630% 0.89580 0.89568
WRN-52-10 - 98.039503% 97.980980% 0.88159 0.88323 another trained model has 97.52%/0.89322
ResNext29 8x64 - 97.190929% 97.161668% 0.89533 0.89733 our best model during competition
DPN92 - 97.190929% 97.249451% 0.89075 0.89286
DenseNet-BC (L=100, k=12) 95.52% 97.161668% 97.147037% 0.88946 0.89134
DenseNet-BC (L=190, k=40) - 97.117776% 97.147037% 0.89369 0.89521

Results with Mixup

After the competition, some of the networks were retrained using mixup: Beyond Empirical Risk Minimization by Hongyi Zhang, Moustapha Cisse, Yann N. Dauphin and David Lopez-Paz.

        Model         CIFAR10
test set
accuracy
Speech Commands
test set
accuracy
Speech Commands
test set
accuracy with crop
Speech Commands
Kaggle private LB
score
Speech Commands
Kaggle private LB
score with crop
        Remarks        
VGG19 BN - 97.483541% 97.542063% 0.89521 0.89839
WRN-52-10 - 97.454279% 97.498171% 0.90273 0.90355 same score as the 16-th place in Kaggle

Releases

No releases published

Packages

No packages published