Automating Sequence to Sequence Piano Transcription With Transformers

Sequence Piano Transcription With Transformers

Piano transcription is the process of converting an audio file into a symbolic representation. Traditionally, this has involved converting the onsets of each note to a list of notation symbols, but more recently methods have been developed which directly convert raw audio into MIDI.

Many different approaches to piano transcription use recurrent neural networks (RNNs) to model the sequence of onsets, notes, and pedal events that comprise a piece of music. RNNs have been shown to perform well on speech transcription tasks, but there is little literature on their performance for piano. [1, 2, 3] Some studies focus on modelling the acoustic structure of piano, while others are concerned with decoding into MIDI (the onsets, notes, and pedals of the musical score).

https://www.tartalover.net/

One challenge is that MIDI is a bit-oriented format, meaning that a single note can have several durations depending on how it is played. This can cause confusion when trying to automatically transcribe from an audio sequence into a MIDI file. [4] A few recent papers have addressed this issue by introducing an early stopping mechanism that reduces the probability of initiating a note if it is predicted to occur too soon. This improves the accuracy of the decoding process, but does not fully address the issue of timing misalignment between MIDI and the original audio.

Automating Sequence to Sequence Piano Transcription With Transformers

Another challenge is that piano transcription models typically require a large number of parameters, and the resulting models can be difficult to understand. Attempts at using more interpretable model architectures have been made, including WaveNet and attention-based models. [5, 6, 7, 8, 9] Unfortunately, such models typically perform worse than the simple LSTM architectures used in the majority of studies on this task.

Recently, research has focused on addressing the problems of timing and interpretation by developing new encoder-decoder models that translate raw audio into a set of MIDI events. A number of these models have been trained on the MAPS and MAESTRO datasets. [10, 11, 13, 14, 15, 17, 19, 20, 21, 23] However, some of these systems do not work well in the presence of ambiguous onset and offset labels and/or have difficulty detecting the correct MIDI pedal events.

This article introduces a sequence to sequence piano transcription system that uses Transformers and can operate on raw audio without the need for the extra time-to-MIDI conversion stage. The system learns a hierarchical representation of the sequence of onsets and notes in the raw audio, and uses it to generate a MIDI file that is both accurate and interpretable. The method also demonstrates that an attention mechanism beyond basic feature learning is not needed for the task, and that simple data augmentation techniques can achieve comparable results.

The results of this work show that a high-resolution Transformer model can achieve comparable performance to state of the art systems when operating on the MAESTRO and MAPS datasets, and it can handle the ambiguities introduced by varying the manner in which note onsets and offsets are annotated on each dataset.