
Cleaning up audio usually means scrubbing timelines and tweaking filters, but Meta thinks it should be as easy as describing the sound you want. The company has released a new open-source AI model called SAM Audio that can isolate almost any sound from a complex recording using simple text prompts.
Users can pull out specific noises like voices, instruments, or background sounds without digging through complicated editing software. The model is now available through Meta’s Segment Anything Playground that houses other prompt-based image and video editing tools.
Broadly speaking, SAM Audio is designed to understand what sound you want to work with and separate it cleanly from everything else. Meta says this opens the door to faster audio editing for use cases like music production, podcasting, film and television, accessibility tools, and research.
For example, a creator could isolate vocals from a band recording, remove traffic noise from a podcast, or delete a barking dog from an otherwise perfect recording, all by describing what they want the model to target.
How SAM Audio works
SAM Audio is a multimodal model that supports three different types of prompts. Users can describe a sound using text, click on a person or object in a video to visually identify the sound they want to isolate, or mark a time span where the sound first appears. These prompts can be used alone or combined, giving users fine-grained control over what gets separated.
Under the hood, the system relies on Meta’s Perception Encoder Audiovisual engine. It acts as the model’s ability to recognize and understand sounds before slicing them out of the mix.
To improve audio separation evaluation, Meta has also introduced SAM Audio-Bench, a benchmark for measuring how well models handle speech, music, and sound effects. It is accompanied by SAM Audio Judge, which evaluates how natural and accurate the separated audio sounds to human listeners, even without reference tracks to compare against.
Meta claims these evaluations show SAM Audio performs best when different prompt types are combined and can handle audio faster than real-time, even at scale.
That said, the model has clear limitations. It does not support audio-based prompts, cannot perform full separation without any prompting, and struggles with similar overlapping sounds, such as isolating a single voice from a choir.
Meta says it plans to improve these areas and is already exploring real-world applications, including accessibility work with hearing-aid makers and organizations supporting people with disabilities.
The launch of SAM Audio ties into Meta’s broader AI push. The company is improving voice clarity on its AI glasses for noisy environments, working toward next-generation mixed reality glasses expected to arrive in 2027, and developing a conversational AI that could rival ChatGPT, signaling a wider focus on AI models that understands sound, context, and interaction.
