pip install -r requirements.txt
- Using Demucs to extract the music and lyrics in the original audio.
- Resampling original audio to 16K audio.
- Creating new vocab dictionary for Wav2Vec2.
- Selecting segments from labels randomly and merge them to create new pair of audio/lyric.
- Fine-tuning Wav2Vec2 model with original CTC loss with all training data with the new vocab dictionary.
- Using forced-alignment (dynamic programming) to find the best alignment path between audio and lyric.
- Merging character durations to obtain words segment index from the audio.
Download data here and prepare a dataset in the following format:
|- data/
| |- public_test/
| |- lyrics/
| |- new_labels_json/
| |- songs/
| |- train/
| |- labels/
| |- songs/
sh reproduce.sh
you can also download, extract our checkpoints here and will obtain the following format:
|- checkpoints/
| |- dragonSwing/
| |- wav2vec2-base-vietnamese/
| |- checkpoint-5500/
| |- pytorch_model.bin
python submission.py submission --saved_path ./result
zip -r submit.zip result/*.json