Separate Anything You Describe
Xubo Liu1, Qiuqiang Kong2, Yan Zhao2, Haohe Liu1, Yi Yuan1, Yuzhuo Liu2
Rui Xia2, Yuxuan Wang2, Mark D. Plumbley1, Wenwu Wang1
1CVSSP, University of Surrey, Guildford, UK
2Speech, Audio & Music Intelligence (SAMI), ByteDance
AbstractLanguage-queried audio source separation (LASS) is a new paradigm for computational auditory scene analysis (CASA). LASS aims to separate a target sound from an audio mixture given a natural language query, which provides a natural and scalable interface for digital audio applications. Recent works on LASS, despite attaining promising separation performance on specific sources (e.g., musical instruments, limited classes of audio events), are unable to separate audio concepts in the open domain. In this work, we introduce AudioSep, a foundation model for open-domain audio source separation with natural language queries. We train AudioSep on large-scale multimodal datasets and extensively evaluate its capabilities on numerous tasks including audio event separation, musical instrument separation, and speech enhancement. AudioSep demonstrates strong separation performance and impressive zero-shot generalization ability using audio captions or text labels as queries, substantially outperforming previous audio-queried and language-queried sound separation models. For the reproducibility of this work, we will release the source code, evaluation benchmark and pre-trained model. |
---|
AudioSepAudioSep is a foundation model for open-domain sound separation with natural language queries. AudioSep has two key components: a text encoder and a separation model, as illustrated in the below Figure 1. |
---|
Musical Instrument Separation
|
|
|
|
---|---|---|---|
|
|
|
|
"acoustic guitar" |
|
|
|
"cello" |
|
|
|
"erhu" |
|
|
|
|
|
|
|
---|---|---|---|
"flute" |
|
|
|
"saxophone" |
|
|
|
"trumpet" |
|
|
|
"violin" |
|
|
|
"tuba" |
|
|
|
Audio Event Separation
|
|
|
|
---|---|---|---|
"water drops" |
|
|
|
"keyboard typing" |
|
|
|
"laughing" |
|
|
|
"cat" |
|
|
|
"door wood knock" |
|
|
|
Audio Event Separation (w. AudioCaps caption)
|
|
|
|
---|---|---|---|
"A rocket flies by followed by a loud explosion and fire crackling as a truck engine runs." |
|
|
|
"A series of burping." |
|
|
|
"A ticktock sound playing at the same rhythm with piano." |
|
|
|
"A man speaks then a small bird chirps." |
|
|
|
"Footsteps and scuffing occur, after which a door grinds, squeaks and clicks, an adult male speaks, and the door grinds, squeaks and clicks shut." |
|
|
|
Separation on Real Audio(w. FreeSound)
|
|
|
---|---|---|
"The bell is ringing and emitting faint echoes." |
|
|
"The fire crackles in the wind, creating a rhythmic and soothing sound." |
|
|
"Something is falling to the ground and making a clank." |
|
|
"In the forest, the birds are chirping incessantly." |
|
|
Acknowledgement
This work is partly supported by UK Engineering and Physical Sciences Research Council (EPSRC) Grant EP/T019751/1 “AI for Sound”, British Broadcasting Corporation Researchand Development (BBC R&D), a PhD scholarship from the Centre for Vision, Speech and Signal Processing, Faculty of Engineering and Physical Science, University of Surrey, and a Research Scholarship from the China Scholarship Council (CSC). For the purpose of open access, the authors have applied a Creative Commons Attribution (CC BY) license to any Author Accepted Manuscript version arising
Page updated on 9 August 2023