This project was developed for my Speech Processing course in my University. The model architecture was inspired by DeepSpeech2's architecture.
I highly recommend using conda virtual environment. I implemented this model with Pytorch and Pytorch Lightning.
pip install -r requirements.txt
The dataset used for training and evaluating this model was reccored and cleaned by me and my teamates. It contains 3800 wav files of 18 commands below:
python train.py --epoch [num of epochs] --batch_size [batchsize] --data [path to image directory] --vocab [path to vocab model file] --mode [decode mode: 'greedy' or 'beam']
I used CTC as loss function. There are two strategies for decoding task, Greedy or BeamSearch decoder.