PhD researcher
“Excuse me while I kiss this guy...” Jimi Hendrix’s lyrics to Purple Haze are frequently misheard, and beg the question − if the lyrics of a song are not written down, how are we to know what they are? PhD student Emir Demirel, working with his supervisor Professor Simon Dixon, set out to solve this problem.
Around ninety percent of recorded songs do not have transcribed lyrics. Lyrics help listeners to understand and appreciate music. Accurate lyric transcription also helps composers to create sheet music of their songs from an audio recording.
Speech recognition software is widely used and has a high degree of accuracy transcribing the spoken word, but when it is tested with sung words, it operates poorly. Can a high-quality Automatic Lyrics transcription (ALT) system be developed, which creates a text file as close as possible to the original words of the song, or those captured by a human transcriber?
There are two main approaches to recognising human speech and transcribing it:
This system recognises phonemes, the basic building blocks of speech. All words consist of sequences of phonemes. The system “listens” to an acoustic file and maps it against the phonemes it knows, creating a number of phoneme possibilities for each sound. It maps these against the word probabilities it knows, using a pronunciation model.
Once all the probabilities are in place, the program creates a graph with all the possible word alternatives. This is then decoded using an algorithm such as beam search, which chooses the most likely next steps in a sequence.
This system, in contrast, provides a single model of neural networks. A neural network is a computer system inspired by the human brain and nervous system. This allows researchers to model the type of information that is processed in human perception and cognition systems. End-to-end models purely consist of neural networks and can process information given an objective and sufficient amount of training data. In the context of speech recognition, they don’t require a pronunciation model, so they don’t need prior human or linguistic expertise.
Demirel’s research received funding from Horizon 2020, the EU's funding programme for research and innovation. He set out to see how best to adapt state-of-the-art automatic speech recognition software to recognise singing data. He wanted to identify how sung lyrics differ from spoken words in terms of pronunciation. He would then use this knowledge to improve word recognition. His research focused on building a robust system that can operate well in varying acoustic conditions, such as different music styles, instruments or recording conditions.
There are existing lyrics transcription models, but the researchers noted there was a significant drop in effectiveness when monophonic (single voice/ acapella) models try to transcribe from polyphonic recordings, where the singing is accompanied, and vice versa. Would it be possible to construct a single model that worked across all possible domains?
On their first attempt, Demirel and his co-researchers used a state-of-the-art DNN-HMM speech recognition framework, trained on Stanford University’s Digital Archive of Mobile Performances (DAMP) dataset. This is the benchmark collection of monophonic recordings used in the study of lyrics transcription.
They added their own novel acoustic model and exploited neural networks to build language models. With these adaptations, they were able to improve the word error rate by up to fifteen percent.
Unsurprisingly, when singing, people form longer vowel sounds. They also sometimes omit the final consonant in a word. Demirel embedded these observations into a pronunciation dictionary and created a singer-adapted lexicon which can be used specifically for lyrics transcription. This again led to a marginal but consistent improvement in the word error rate.
The research held to the development of a new, neural network architecture − Multistreaming Acoustic Modelling for Automatic Lyrics Transcription (MSTRE-Net) − an innovative system that attempts to model the way the human ear processes auditory information.
Previously, transcribers were either trained on acapella singing or polyphonic recordings or were trained using particular data sets – either the DAMP dataset previously mentioned, or the DALI dataset from IRCAM in Paris, which contains more than 5000 polyphonic recordings. Demirel merged both data sets and then used them to train a cross-domain model.
The researchers also taught the model to distinguish the silence and the instrumental (where non-vocal music is playing) sections to improve lyrics transcription performance.
Demirel and his co-researchers have presented five publications at peer-reviewed conferences. Their Open-Source software is available on GitHub.
It has also been turned into the first commercial application of lyrics transcription technology. In collaboration with Doremir, Scorecloud Songwriter allows composers to sing and play into a single mic to get a lead sheet, with lyrics and chords.
The technology is also in use in music practising software such as Moises App. The technology can also be used to transcribe lyrics for the thousands of songs that are released in the music environment each week.
Demirel hopes to further leverage rhyming information and lyrics in languages other than English. He will also be challenging the system on more complex cases like the “brutal” lyrics in death metal, opera singing and custom words.