AI sparks dramatic advances in voice technologies

Interview with Nicolas Obin

Nicolas Obin is both a lecturer at Sorbonne University and a researcher in the Sciences and Technologies of Music and Sound laboratory at IRCAM working on voice synthesis for over a decade. A specialist in speech processing and human communication, he continues his research on the application of the latest advances in artificial intelligence (AI) to voice and related technologies. He works both with the experts from the SCAI (the Center for Artificial Intelligence at the Sorbonne) and with renowned artists as a part of his artistic commitment to IRCAM including Eric Rohmer, Philippe Parreno, Roman Polansky, Leos Carax, Georges Aperghis, and this year Alexander Schubert. We meet him on the occasion of the ManiFeste-2022 festival, during which he is organizing the "Deep Voice, Paris" meetings that he co-founded with Xavier Fresquet of the SCAI.

In the age of AI, deep learning and vocal assistants, IRCAM is a pioneer in the creation of synthesized voices. Nicolas, can you tell us more about the research you do on a daily basis in the Sound Analysis and Synthesis team?

With vocal assistants, voice has become the favored modality of interaction between humans and the connected machines that populate our everyday lives. The voice makes it possible to bring the machine to life and to give it a semblance of humanity. My research focuses on the digital modeling of the human voice at the interface of linguistics, computer science, machine learning, and artificial intelligence. The goal is to better understand the human voice and communication in order to create speech machines, clone a person's vocal identity or manipulate personality attributes such as age, gender, attitudes, or emotions.

Our team, led by Axel Roebel, has extensive scientific and technological expertise on the human voice, including the transfer of our research advances to professional audio plugins dedicated to the voice such as IrcamTools TRAX or IrcamLab TS, which are commonly used by sound designers in the film industry. For example, sound designer Nicolas Becker used the features in IrcamLab TS to recreate the sensation of progressive hearing loss in the film Sound of Metal, for which he won an Academy Award for Sound Design.

In addition to artistic collaborations, we are constantly working with brands and companies to give an artificial voice to personal assistants, virtual agents, or humanoid robots.

Sciences and Technologies of Music and Sound laboratory at IRCAM © Philippe Barbosa

You are the co-founder of "Deep Voice, Paris" an annual event dedicated to voice and artificial intelligence which will be held this year from June 15 to 17. What will be the focus of this 2nd edition?

The theme of this second edition is diversity and inclusion in voice technologies for a digital world that is better tailored to and more representative of the diversity of individuals, cultures, and languages. While there are between 6,000 and 7,000 living languages in the world today— including sign language—only a few dozen, a hundred at best, are present in the digital world, whether in search engines, for translation, or as vocal assistants. The objective of "Deep Voice, Paris" is to bring together the actors of scientific research and technological innovation to imagine the uses and practices of the future, but also to reflect on the contribution of digital technology in the world of today and tomorrow.

We are looking forward to welcoming some of the leading innovators in these fields, in particular the first non-generic artificial voice created by the members of the Q project, Mozilla's Common Voice open science initiative, the companies Navas Lab Europe, and ReadSpeaker specialized in multilingual speech synthesis and virtual agents, and the Californian startup SANAS which is able to transform a person's accent in quasi real-time! The "Deep Voice, Paris" meetings provide an opportunity to keep abreast of technological developments, to meet the players, and to participate in discussions on their use in our daily lives.

With the ANR project TheVoice, you addressed the creation of voices for content production in the creative industry sector. Has this applied research consortium led to any significant achievements?

The ANR project TheVoice was an opportunity for us to work very closely with the creative industry, such as production and post-production companies and especially for dubbing. It gave us the opportunity to better understand the voice professions, the industrial and cultural issues, and to bring totally new artificial intelligence solutions to a sector that is especially demanding in terms of quality.

More specifically, we designed algorithms that enabled us to transfer the vocal identity of one person onto the voice of another, in other words, a "vocal deep fake". We have already used these innovations — as part of a project conducted with IRCAM Amplify — for Thierry Ardisson's new TV show "Hôtel du Temps", in which deep fake technologies are used to give a digital life to personalities during an interview; to recreate the voice of Isaac Asimov, one of the founders of science fiction, in a documentary being produced by Arte; and to create artificial voices in Alexander Schubert's latest work Anima, just performed at the Centre Pompidou during ManiFeste.

And from the point of view of fundamental research, what are the challenges for the creation of the voices in the future, the latest challenges to be taken up by researchers?

The boom in artificial intelligence in the mid-2010s spurred impressive advances in all areas of digital technology, and in particular in voice technologies. In 2018, the first artificial voices considered as natural as human voices were created, and this achievement has crossed the threshold of a kind of "vocal singularity". This first prefigures the rapid and profound mutations linked to the modeling and simulation of digital humans and our modes of interaction with machines, in an ever-increasing immersion in digital technology. But, despite these advances, voice remains a complex manifestation of the human being: in contrast to 3D animation widely used in movies, video games, or virtual reality, the fields of application of artificial voices are still very limited.

The upcoming research challenges are numerous: they consist in manipulating the attributes of a voice to create digital filters allowing to sculpt the personality of a human or artificial voice, to improve the modeling of the voice in interaction, in particular with its context, to allow a fluid vocal interaction, personalized and adapted to an interlocutor and to a situation, the whole in an ultra-realistic way, possibly guided by physics... and in real-time!

This proliferation of research and innovation is a fabulous breeding ground for experimentation for artists. They can now produce voices and vocalizations that are virtually unheard of, that is to say, free from the constraints of nature and physics, whether it be to make a voice sing with an inhuman range (unattainable by a human being) or to create cyber-physical objects endowed with hybrid voices, such as making a tree, a lamp, or a guitar speak. These are endless possibilities for imagining new forms of expression and creative sound artifacts at the interface of the human and the machine, the singular and the universal, the real, and the virtual.