Facebook’s AI team has introduced a way to separate up to five sounds simultaneously from a single microphone. The proposed method is superior to previous methods in multiple speech source separation Benchmarks (including challenging noise and reverberation benchmarks). Using the wsj0-2mix and wsj0-3mix datasets, as well as the variants of four and five simultaneous loudspeakers, the scale invariant signal-to-noise ratio (a common measure of separation quality) of the model is improved by more than 1.5 dB (DB) compared with the most advanced models.
To build the model, the team used a new recurrent neural network structure that acts directly on the original audio waveform. In the past, the best model mainly used mask and decoder to classify each speaker’s voice. When the number of speakers is large or unknown, the performance of this kind of model will decline rapidly.
Like the standard voice separation system, the Facebook AI team’s model requires prior knowledge of the total number of speakers. But in order to deal with the challenge of unknown number of speakers, researchers build a new system to automatically detect the number of speakers and select the most relevant model.
1. Working principle
The main goal of the speech separation model is to estimate the input source and create a separate channel output for each speaker when given an input mixed speech signal.
The model uses an encoder network that maps the input signal to a potential representation. The team applied a speech separation network composed of multiple blocks, in which the input is the potential representation and the output is the estimated signal of each speaker. Previous methods usually use masks when performing separation, but when the mask is undefined and some signal information may be lost during processing, the problem will appear.
Researchers train the model through permutation invariant training and use multiple loss functions to directly optimize the signal-to-noise ratio. The team further improved the optimization process by inserting a loss function after each separated block. Finally, to ensure that each speaker is consistently mapped to a specific output channel, Facebook uses a pre trained speaker recognition model to add a perceptual loss function.
The team also built a new system to deal with the separation of an unknown number of speakers. For the new system, two, three, four and five speakers are separated by training different models. The researchers input mixed input into a model designed to accommodate up to five simultaneous speakers so that it can detect the number of currently active (non stationary) channels. Facebook then repeats the same process with a training model and checks if all output channels are active. Repeat this process until all channels are activated or the model with the least number of target speakers is found.
The ability to separate a single voice from a multi person conversation can improve and enhance our daily communication through a variety of applications, such as voice messaging, digital assistant and video tools, and AR / VR voice interaction innovation. It can also improve the experience of people in need of hearing aids, so that they can hear other people’s voices more clearly in a crowded and noisy environment, such as parties or restaurants.
In addition to separating different sounds, the new system can also be applied to separate other types of speech signals from mixed sounds (such as background noise). In addition, this research can be applied to music recording to improve the previous research on separating different musical instruments from a single audio file. Facebook said the next step will be to improve the generation properties of the model until it can achieve high performance in real-world conditions.