Scientists at the University of California, San Francisco have developed an efficient speech synthesizer using deep learning algorithm, which is expected to make people who lose their voice due to illness “speak” at a normal speed.
Stroke, cerebral palsy, amyotrophic lateral sclerosis (frostbite) and other diseases may make patients lose the ability to speak. At present, some devices on the market can spell out what the patient wants to say word by word by tracking the movement of the patient’s eyes or facial muscles, and then use the speech synthesizer to “say” these words. The most famous example is the late British physicist Stephen Hawking who suffered from frostbite. In his later years, he needed to use one cheek muscle to control the voice synthesizer.
However, this way of communication is extremely inefficient. Generally, it will not exceed 10 words per minute, while the speed of normal people is about 150 words per minute.
Instead of typing first and then reading aloud, scientists at the University of California, San Francisco are trying to use algorithms to establish the relationship between brain signals and vocal tract activities – after finding out such a relationship, they can convert signals in the brain into corresponding vocal tract movements and then make sound.
To achieve this, the researchers recruited five volunteers with epilepsy. The volunteers were able to speak normally, and electrodes were temporarily implanted in their brains to identify lesions before surgery. This allows researchers to monitor the activity of the brain’s language centers while volunteers speak.
The researchers asked volunteers to read some given sentences aloud. While the volunteers read them, the activities of the language center in the brain were recorded. The researchers then combined these brain activity records with previously identified vocal tract motion data.
The researchers used these data to train a set of deep learning algorithms, and then integrated the algorithms into the decoder. This device first converts brain signals into vocal tract motion, and then converts vocal tract motion into synthetic speech.
Stephanie RI è s, a neuroscientist at San Diego State University who was not involved in the study, said that speech generated by connecting brain activity with vocal tract movement and then converting vocal tract movement into sound is easier to understand than speech generated by directly connecting brain activity with sound.
“In fact, few of us really know what happens to our mouths when we speak,” said Edward Chang, a neurosurgeon and corresponding author of the paper. “The brain converts what you want to say into vocal tract movement, and that’s what we’re trying to decode.” Chang said that people who have heard synthetic sentences can understand an average of 70% of the words.
Scientists have previously used artificial intelligence technology to interpret brain activity into a single word, but most of them are simple monosyllabic words“ Jumping from monosyllabic to sentence is technically challenging, which is one of the most impressive aspects of this study, “commented chethan pandarinath, a neuroengineer at Emory University who was not involved in the study.
“When we first heard the results, we were shocked – we couldn’t believe our ears. It’s incredible that many speech synthesizers get real output. ” “Of course, much remains to be done to make speech more natural and clear, but we are impressed by how much brain activity we can decode,” said Josh Chartier, a doctoral student at the University of California, San Francisco and co-author of the paper
“We hope these findings will bring hope to those whose expression is blocked, and one day we will be able to restore our ability to communicate, which is one of the foundations of our humanity,” he added.