This tool helps you convert synthesia-like piano videos of Youtube to a MIDI (.mid) file that can be read by most music tools. It makes extensive use of OpenCV to process the video frames, and to extract the notes played.
Currently implemented:
- YoutubeStitch.py: Stitch together the frames of the Youtube video into a tall png image of the keys pressed.
- WaterfallProcess.py: Convert the stitched image to a list of list of rectangles representing the keys pressed. Very buggy, not yet usable for note extraction.
Todo:
- Better processing of the notes, especially close ones. (Try again pseudo-gradient descent but with cost function in absolute pixels, like L=[10black_pixels - 1white_pixels], where the pixels proposed are the ones enclosed by a rectangle. Beware of the fact that close rectangles will mask each otehr, L also needs an overlap term.)
- Implement ReadNotes.py, which converts the processed rectangles to a list of notes.
- Fix YoutubeStitch.py so that the note timings are respected. (Find average scroll rate, and blindly stitch ? Scroll rate might not be constant...)
This is the tool that stitches together the frames of the Youtube video into a tall png image of the keys pressed.
This is the result of the script for the first 30 seconds of Chopin - Ballade No. 1 played by Rousseau (https://www.youtube.com/watch?v=Zj_psrTUW_w) This project will consist in 3 parts:
- YoutubeStitch.py
- WaterfallProcess.py
- ReadNotes.py
Here is the usage for YoutubeStitch.py:
python YoutubeStitch.py <url> <height> <interval> <start> <stop>
- url (string): Url of the Youtube video
- height (string): Percent of the height of the video to process (starting from the top, to ignore the hands of the player)
- interval (float): Interval in seconds between the frames, to allow the script to run faster
- start (float): Start position of the video, in seconds (to ignore intros)
- stop (float): Stop position of the video, in seconds
The process can run for several minutes. You can speedup the script by lowering the height processed. It also removes a lot of the visual artefacts that might remain at the end. Reducing the height however completely messes up the duration of the long notes and the long silences.
It produces a file called output.png
, which is the stitched image of the video.
This is the tool that converts the stitched image to a list of list of rectangles representing the keys pressed. Usage:
python WaterfallProcess.py <input_png> <output_mid>
Here is the current state of the image processing done with OpenCV: