Artificial intelligence has transformed the mono-choice speech synthesis of decades-old robocalls and GPS navigation systems into the refined virtual assistants’ synthesis integrated into smartphones and smart speakers, highlighted Nvidia.
The company itself has however also pointed out that there is still a gap between the speech synthesized by artificial intelligence and the human one we hear in daily conversations and in the media.
This is the case, Nvidia explains, because people speak with a rhythm, a timbre and complex, which are difficult to emulate for artificial intelligence.
However, the gap is rapidly shrinking according to the company. Nvidia researchers are in fact building models and tools for a high quality and controllable speech synthesis that captures the richness of human speech, without audio artifacts.
The latest projects by Nvidia researchers were shown in the sessions of the recent Interspeed 2021 conference.
These models • believes the company specializing in artificial intelligence • can help give voice to automated customer service lines for banks and retailers, to give life to characters of video games or books and to
Nvidia’s internal creative team uses this technology even to produce an expressive narrative for a series of videos about the power of artificial intelligence.
Until recently, these videos were narrated by a human being. The previous models of vocal synthesis • • • • • • • • • • • • • • •
This has changed over the last year, when Nvidia’s text-to-speech research team developed more powerful and controllable speech synthesis models like RAD-TTS, used in the winning demo at Real-Time competition
By training the text-to-speech model with the sound of an individual’s speech, RAD-TTS can convert any text message into the voice of the speaker.
Expressive speech synthesis is just one of the elements of NVIDIA Research’s work in the artificial conversational intelligence: a field that also includes the elaboration of natural language, automatic speech recognition, keyword detection, improvement of audio and more
Optimized to be efficiently executed on the company’s GPUs, part of this cutting-edge work has been made open source through Nvidia NeMo.
Thanks to Nvidia NeMo • an open source Python toolkit for the fast-paced artificial conversational intelligence of GPUs • researchers, developers and content creators gain an advantage in the experimentation and development of voice models
Easy to use APIs and pre-trained models in NeMo help researchers develop and customize templates for text-to-speed, natural language processing and automatic voice recognition in real time.