Developers and creators can use the most advanced conversational AI model for emotional speech synthesis to generate sound for characters, virtual assistants and personalized images.
AI has transformed synthetic voice from monotonous robot calls and traditional GPS navigation systems into beautiful virtual assistants in smart phones and smart speakers.
However, there is still a gap between AI synthetic speech and human speech we hear in daily dialogue and media. This is because people have complex rhythm, tone and timbre when talking, and AI is difficult to imitate these aspects.
But this gap is narrowing rapidly. NVIDIA researchers are creating high-quality and controllable speech synthesis models and tools that can capture the richness of human speech without audio noise. NVIDIA researchers are currently presenting their latest projects at the interspeed 2021 conference. The conference will last until September 3.
These models help voice automated customer service hotlines for banks and retailers, bring characters to life in video games and books, and provide real-time speech synthesis for digital avatars.
NVIDIA’s internal creative team even used this technology to produce a moving commentary for a series of videos about the power of AI.
Affective speech synthesis is only one of the key work of NVIDIA Research Institute in the field of conversational AI. The field also includes natural language processing, automatic speech recognition, keyword detection, audio enhancement and so on.
After optimization, these cutting-edge works can run efficiently on NVIDIA GPU. Some of them have been open source through NVIDIA Nemo toolkit and can be obtained in NVIDIA NGC container and other software centers.
I am AI behind the scenes gags
NVIDIA researchers and professional creators are not talking about conversational AI. They applied the breakthrough speech synthesis model to the I am AI series of videos. This video series introduces global AI innovators reshaping various industries.
Not long ago, these videos were voiced by humans. Previous speech synthesis models have limited control over the rhythm and tone of synthetic sound, so AI dubbing can not arouse the emotional response of the audience, which can only be achieved by emotional human voice.
In the past year, NVIDIA text speech research team has developed more powerful and controllable speech synthesis models (such as rad-tts), which has changed the above situation. NVIDIA adopted this model in the award-winning demonstration of siggraph real time live competition. By using human voice audio to train the text speech model, rad-tts can convert any text into the speaker’s voice.
Another function of the model is speech conversion, that is, using one speaker’s voice to tell another speaker’s words (or even singing). The rad-tts interface is inspired by the idea of using human voice as a musical instrument. Users can use it for fine frame level control of the tone, duration and intensity of the synthetic sound.
Through this interface, the video producer can read the video text by himself during recording, and then use the AI model to convert his voice as a male narrator into the voice of a female narrator. The producer can use this benchmark narrative to instruct AI like guiding the dubbing actor, such as emphasizing specific words by adjusting the synthetic voice, modifying the narrative rhythm to better express the mood in the video, etc.
The capability of the AI model is beyond the scope of dubbing work: text to speech conversion can be used in games, help people with sound barriers, or help users narrate in different languages with their own voice. It can even reproduce the performance of iconic singers, which can not only match the melody of the song, but also match the emotional expression behind the voice.
Provide AI developers and researchers with powerful voice functions
NVIDIA Nemo is an open source Python toolkit for GPU accelerated conversational AI. By using the toolkit, researchers, developers and creators can take the lead in their own application experiments and fine-tuning speech models.
The easy-to-use API and pre training model in Nemo can help researchers develop and customize models for text to speech conversion, natural language processing and real-time automatic speech recognition. Several of these models are trained with tens of thousands of hours of audio data on NVIDIA DGX system. Developers can fine tune any model according to their own use, and use the hybrid accuracy calculation on NVIDIA tensor core GPU to speed up the training speed.
NVIDIA Nemo also provides a model trained on Mozilla common voice through NGC. The data set has nearly 14000 hours of crowdsourced voice data in 76 languages. The goal of the project is to popularize voice technology through the world’s largest open source data voice dataset with the support of NVIDIA.
Speech technology feast: NVIDIA researchers show the latest progress of AI speech technology
Interspeech brings together more than 1000 researchers who demonstrate the breakthrough in voice technology. At this week’s meeting, NVIDIA Research Institute will demonstrate the conversational AI model architecture and the fully formatted voice data set for developers.
Please pay attention to the following related speeches brought by NVIDIA guests:
● multi microphone voice de reverberation compatible with any scene – August 31 (Tuesday)
● spgispeech: 5000 hours of transcribed financial audio for fully formatted end-to-end speech recognition – Wednesday, September 1
● hi fi multi speaker English TTS dataset – Wednesday, September 1
● talknet 2: non autoregressive depth separable convolution model for speech synthesis with clear pitch and duration prediction – Thursday, September 2
● compress one-dimensional time channel separable convolution using sparse random ternary matrix – Friday, September 3
● Nemo reverse text regularization: from development to production – September 3 (Friday)
You can search the NEMO model in the NGC directory and listen to the lectures given by NVIDIA researchers at the interspeech conference.
NVIDIA shares a video link on affective speech research at interspeech: https://www.youtube.com/watch ？ v=RknIx6XmffA