Separate Anything You Describe

Xubo Liu¹, Qiuqiang Kong², Yan Zhao², Haohe Liu¹, Yi Yuan¹, Yuzhuo Liu²

Rui Xia², Yuxuan Wang², Mark D. Plumbley¹, Wenwu Wang¹

¹CVSSP, University of Surrey, Guildford, UK

²Speech, Audio & Music Intelligence (SAMI), ByteDance

Abstract

Language-queried audio source separation (LASS) is a new paradigm for computational auditory scene analysis (CASA). LASS aims to separate a target sound from an audio mixture given a natural language query, which provides a natural and scalable interface for digital audio applications. Recent works on LASS, despite attaining promising separation performance on specific sources (e.g., musical instruments, limited classes of audio events), are unable to separate audio concepts in the open domain. In this work, we introduce AudioSep, a foundation model for open-domain audio source separation with natural language queries. We train AudioSep on large-scale multimodal datasets and extensively evaluate its capabilities on numerous tasks including audio event separation, musical instrument separation, and speech enhancement. AudioSep demonstrates strong separation performance and impressive zero-shot generalization ability using audio captions or text labels as queries, substantially outperforming previous audio-queried and language-queried sound separation models. For the reproducibility of this work, we will release the source code, evaluation benchmark and pre-trained model.

AudioSep

AudioSep is a foundation model for open-domain sound separation with natural language queries. AudioSep has two key components: a text encoder and a separation model, as illustrated in the below Figure 1.

Musical Instrument Separation

Text Query	Mixture	Separated Audio	Ground Truth
"accordion"

"acoustic guitar"

"cello"

"erhu"

Text Query	Mixture	Separated Audio	Ground Truth
"flute"

"saxophone"

"trumpet"

"violin"

"tuba"

Audio Event Separation

Text Query	Mixture	Separated Audio	Ground Truth
"water drops"

"keyboard typing"

"laughing"

"cat"

"door wood knock"

Audio Event Separation (w. AudioCaps caption)

Text Query	Mixture	Separated Audio	Ground Truth
"A rocket flies by followed by a loud explosion and fire crackling as a truck engine runs."

"A series of burping."

"A ticktock sound playing at the same rhythm with piano."

"A man speaks then a small bird chirps."

"Footsteps and scuffing occur, after which a door grinds, squeaks and clicks, an adult male speaks, and the door grinds, squeaks and clicks shut."

Separation on Real Audio(w. FreeSound)

Text Query	Mixture	Separated Audio
"The bell is ringing and emitting faint echoes."

"The fire crackles in the wind, creating a rhythmic and soothing sound."

"Something is falling to the ground and making a clank."

"In the forest, the birds are chirping incessantly."

Acknowledgement

This work is partly supported by UK Engineering and Physical Sciences Research Council (EPSRC) Grant EP/T019751/1 “AI for Sound”, British Broadcasting Corporation Researchand Development (BBC R&D), a PhD scholarship from the Centre for Vision, Speech and Signal Processing, Faculty of Engineering and Physical Science, University of Surrey, and a Research Scholarship from the China Scholarship Council (CSC). For the purpose of open access, the authors have applied a Creative Commons Attribution (CC BY) license to any Author Accepted Manuscript version arising

Page updated on 9 August 2023

Separate Anything You Describe

Abstract

Abstract

AudioSep

AudioSep

Musical Instrument Separation

👉 Click for more examples.

Audio Event Separation

Audio Event Separation (w. AudioCaps caption)

Separation on Real Audio(w. FreeSound)

Acknowledgement