FlowSep: Language-Queried Sound Separation with Rectified Flow Matching
Yi Yuan, Xubo Liu, Haohe Liu, Mark D. Plumbley, Wenwu Wang
CVSSP, University of Surrey, Guildford, UK
AbstractLanguage-queried audio source separation (LASS) aims to separate sound sources based on textual descriptions of the desired source. Current state-of-the-art (SoTA) models primarily use discriminative approaches, such as time-frequency masking, to separate the target sounds while reducing interference from other sources. However, these models may encounter challenges in separating overlapping soundtracks, degrading the overall performance in real-world scenarios. Recently, the subjective quality of separated sound was enhanced successfully via generative approaches, such as diffusion-based models. Rectified flow matching(RFM), a variant of the diffusion-based generative model that establishes linear connections between the distribution of data and noise, offers superior theoretical properties and simplicity, but has not yet been explored in sound separation. In this work, we introduce FlowSep, a novel generative model based on RFM for LASS tasks. Specifically, FlowSep learns linear flow trajectories from Gaussian noise to the target source features within a pre-trained latent space. During inference, the mel-spectrogram can be reconstructed from the generated latent vector with the pre-trained variational autoencoder (VAE) decoder. Trained on 1,680 hours of audio data, FlowSep outperforms SoTA models across multiple benchmarks, as evaluated on subjective and objective metrics. Additionally, our results show that RFM surpasses diffusion-based approaches in both separation quality and inference efficiency, highlighting its strong potential for audio source separation tasks. |
---|
FlowSepFlowSep is an RFM-based generative model for text-based audio source separation. FlowSep comprises four main components: a T5 encoder for text embedding, a rectified flow matching (RFM) generator for creating audio features within the latent space, a VAE decoder for reconstructing the mel-spectrogram, and a GAN vocoder to produce the final waveform. |
---|
Demos on DCASE2024-Real
|
|
|
|
---|---|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
---|---|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Demos on AudioCaps
|
|
|
|
|
---|---|---|---|---|
"a cat meowing and young female speaking" |
|
|
|
|
"A group of people laughing followed by farting" |
|
|
|
|
|
|
|
|
|
---|---|---|---|---|
"a kid crying as a man and a woman talk followed by a car door opening then closing" |
|
|
|
|
"a woman is speaking from a microphone" |
|
|
|
|
"race cars are racing followed by people talking" |
|
|
|
|
"a man speaks with low speech in the background" |
|
|
|
|
"an engine revving and then tires squealing" |
|
|
|
|
"a chainsaw cutting as wood is cracking" |
|
|
|
|
"a man speaking as vehicles drive by and leaves rustling" |
|
|
|
|
"the sound of horn from a car approaching from a distance" |
|
|
|
|
Demos on DCASE2024-Synth
|
|
|
|
|
---|---|---|---|---|
"with the sound of the car horn, the car is speeding by" |
|
|
|
|
"a person is pressing the shutter button of the camera to check something" |
|
|
|
|
Acknowledgement
This research was partly supported by a research scholarship from the China Scholarship Council (CSC), funded by British Broadcasting Corporation Research and Development (BBC R&D), Engineering and Physical Sciences Research Council~(EPSRC) Grant EP/T019751/1 'AI for Sound', and a PhD scholarship from the Centre for Vision, Speech and Signal Processing(CVSSP), University of Surrey. For the purpose of open access, the authors have applied a Creative Commons Attribution(CC BY) license to any Author Accepted Manuscript version arising.
Page updated on 10 Sep 2024