Separate Anything You Describe



Xubo Liu1, Qiuqiang Kong2, Yan Zhao2, Haohe Liu1, Yi Yuan1, Yuzhuo Liu2

Rui Xia2, Yuxuan Wang2, Mark D. Plumbley1, Wenwu Wang1

1CVSSP, University of Surrey, Guildford, UK

2Speech, Audio & Music Intelligence (SAMI), ByteDance


Paper                   Code


Abstract

Language-queried audio source separation (LASS) is a new paradigm for computational auditory scene analysis (CASA). LASS aims to separate a target sound from an audio mixture given a natural language query, which provides a natural and scalable interface for digital audio applications. Recent works on LASS, despite attaining promising separation performance on specific sources (e.g., musical instruments, limited classes of audio events), are unable to separate audio concepts in the open domain. In this work, we introduce AudioSep, a foundation model for open-domain audio source separation with natural language queries. We train AudioSep on large-scale multimodal datasets and extensively evaluate its capabilities on numerous tasks including audio event separation, musical instrument separation, and speech enhancement. AudioSep demonstrates strong separation performance and impressive zero-shot generalization ability using audio captions or text labels as queries, substantially outperforming previous audio-queried and language-queried sound separation models. For the reproducibility of this work, we will release the source code, evaluation benchmark and pre-trained model.

AudioSep

AudioSep is a foundation model for open-domain sound separation with natural language queries. AudioSep has two key components: a text encoder and a separation model, as illustrated in the below Figure 1.

Figure 1: Framework of AudioSep.



Musical Instrument Separation

Text Query
Mixture
Separated Audio
Ground Truth
"accordion"
fname
fname
fname
"acoustic guitar"
fname
fname
fname
"cello"
fname
fname
fname
"erhu"
fname
fname
fname
Text Query
Mixture
Separated Audio
Ground Truth
"flute"
fname
fname
fname
"saxophone"
fname
fname
fname
"trumpet"
fname
fname
fname
"violin"
fname
fname
fname
"tuba"
fname
fname
fname


Audio Event Separation

Text Query
Mixture
Separated Audio
Ground Truth
"water drops"
fname
fname
fname
"keyboard typing"
fname
fname
fname
"laughing"
fname
fname
fname
"cat"
fname
fname
fname
"door wood knock"
fname
fname
fname


Audio Event Separation (w. AudioCaps caption)

Text Query
Mixture
Separated Audio
Ground Truth
"A rocket flies by followed by a loud explosion and fire crackling as a truck engine runs."
fname
fname
fname
"A series of burping."
fname
fname
fname
"A ticktock sound playing at the same rhythm with piano."
fname
fname
fname
"A man speaks then a small bird chirps."
fname
fname
fname
"Footsteps and scuffing occur, after which a door grinds, squeaks and clicks, an adult male speaks, and the door grinds, squeaks and clicks shut."
fname
fname
fname

Acknowledgement

This work is partly supported by UK Engineering and Physical Sciences Research Council (EPSRC) Grant EP/T019751/1 “AI for Sound”, British Broadcasting Corporation Researchand Development (BBC R&D), a PhD scholarship from the Centre for Vision, Speech and Signal Processing, Faculty of Engineering and Physical Science, University of Surrey, and a Research Scholarship from the China Scholarship Council (CSC). For the purpose of open access, the authors have applied a Creative Commons Attribution (CC BY) license to any Author Accepted Manuscript version arising




Page updated on 9 August 2023