===== Music source separation in the waveform domain =====
//Abstract//:
Source separation for music is the task of isolating contributions, or stems, from different instruments recorded individually and arranged together to form a song. Such
components include voice, bass, drums and any other accompaniments. Contrarily
to many audio synthesis tasks where the best performances are achieved by models
that directly generate the waveform, the state-of-the-art in source separation for
music is to compute masks on the magnitude spectrum. In this paper, we compare
two waveform domain architectures. We first adapt Conv-Tasnet, initially developed
for speech source separation, to the task of music source separation. While ConvTasnet beats many existing spectrogram-domain methods, it suffers from significant
artifacts, as shown by human evaluations. We propose instead Demucs, a novel
waveform-to-waveform model, with a U-Net structure and bidirectional LSTM.
Experiments on the MusDB dataset show that, with proper data augmentation, Demucs
beats all existing state-of-the-art architectures, including Conv-Tasnet, with 6.3
SDR on average, (and up to 6.8 with 150 extra training songs, even surpassing the
IRM oracle for the bass source). Using recent development in model quantization,
Demucs can be compressed down to 120MB without any loss of accuracy. We also
provide human evaluations, showing that Demucs benefit from a large advantage
in terms of the naturalness of the audio. However, it suffers from some bleeding,
especially between the vocals and other source.
I'm happy to release the v3 of Demucs for Music Source Separation, with hybrid domain prediction, compressed residual branches and much more. Checkout the code:https://t.co/ptJ4IvXggU
— Alexandre Défossez (@honualx) November 10, 2021
Here is a demo for you @jaimealtozano, I'm sure you'll enjoy the improvements! pic.twitter.com/oTQu4ZJ9Iv