FlowSep: Language-Queried Sound Separation with Rectified Flow Matching



Yi Yuan, Xubo Liu, Haohe Liu, Mark D. Plumbley, Wenwu Wang

CVSSP, University of Surrey, Guildford, UK


Paper                   Code


Abstract

Language-queried audio source separation (LASS) aims to separate sound sources based on textual descriptions of the desired source. Current state-of-the-art (SoTA) models primarily use discriminative approaches, such as time-frequency masking, to separate the target sounds while reducing interference from other sources. However, these models may encounter challenges in separating overlapping soundtracks, degrading the overall performance in real-world scenarios. Recently, the subjective quality of separated sound was enhanced successfully via generative approaches, such as diffusion-based models. Rectified flow matching(RFM), a variant of the diffusion-based generative model that establishes linear connections between the distribution of data and noise, offers superior theoretical properties and simplicity, but has not yet been explored in sound separation. In this work, we introduce FlowSep, a novel generative model based on RFM for LASS tasks. Specifically, FlowSep learns linear flow trajectories from Gaussian noise to the target source features within a pre-trained latent space. During inference, the mel-spectrogram can be reconstructed from the generated latent vector with the pre-trained variational autoencoder (VAE) decoder. Trained on 1,680 hours of audio data, FlowSep outperforms SoTA models across multiple benchmarks, as evaluated on subjective and objective metrics. Additionally, our results show that RFM surpasses diffusion-based approaches in both separation quality and inference efficiency, highlighting its strong potential for audio source separation tasks.

FlowSep

FlowSep is an RFM-based generative model for text-based audio source separation. FlowSep comprises four main components: a T5 encoder for text embedding, a rectified flow matching (RFM) generator for creating audio features within the latent space, a VAE decoder for reconstructing the mel-spectrogram, and a GAN vocoder to produce the final waveform.

Figure 1: Framework of FlowSep.



Demos on DCASE2024-Real

Text Query
Mixture
AudioSep
FlowSep
"the crowd is cheering and giving applause"
fname
fname
fname
"someone is beating the drum continuously"
fname
fname
fname
Text Query
Mixture
AudioSep
FlowSep
"the woman is speaking in a distance"
fname
fname
fname
"the woman is talking with others"
fname
fname
fname
"a bunch of people are cheering in unison"
fname
fname
fname
"the water is flowing and gurgling in the stream"
fname
fname
fname
"a female is speaking and the cutlery is clanging"
fname
fname
fname
"a cat is meowing amidst the faint sound of bird calls"
fname
fname
fname
"in the forest, the birds are chirping incessantly"
fname
fname
fname
"a car is passing by a noisy road"
fname
fname
fname


Demos on AudioCaps

Text Query
Mixture
AudioSep
FlowSep
Ground Truth
"a cat meowing and young female speaking"
fname
fname
fname
fname
"A group of people laughing followed by farting"
fname
fname
fname
fname
Text Query
Mixture
AudioSep
FlowSep
Ground Truth
"a kid crying as a man and a woman talk followed by a car door opening then closing"
fname
fname
fname
fname
"a woman is speaking from a microphone"
fname
fname
fname
fname
"race cars are racing followed by people talking"
fname
fname
fname
fname
"a man speaks with low speech in the background"
fname
fname
fname
fname
"an engine revving and then tires squealing"
fname
fname
fname
fname
"a chainsaw cutting as wood is cracking"
fname
fname
fname
fname
"a man speaking as vehicles drive by and leaves rustling"
fname
fname
fname
fname
"the sound of horn from a car approaching from a distance"
fname
fname
fname
fname


Demos on DCASE2024-Synth

Text Query
Mixture
AudioSep
FlowSep
Ground Truth
"with the sound of the car horn, the car is speeding by"
fname
fname
fname
fname
"a person is pressing the shutter button of the camera to check something"
fname
fname
fname
fname

Acknowledgement

This research was partly supported by a research scholarship from the China Scholarship Council (CSC), funded by British Broadcasting Corporation Research and Development (BBC R&D), Engineering and Physical Sciences Research Council~(EPSRC) Grant EP/T019751/1 'AI for Sound', and a PhD scholarship from the Centre for Vision, Speech and Signal Processing(CVSSP), University of Surrey. For the purpose of open access, the authors have applied a Creative Commons Attribution(CC BY) license to any Author Accepted Manuscript version arising.




Page updated on 10 Sep 2024