Clones, Filters, and Fakes... Ethics and AI

The ability of the latest generations of AI to generate content with extremely realistic rendering opens up new possibilities for creation, while at the same time raising ethical and commercial concerns. With 40 years' experience of research into voice synthesis and transformation, IRCAM's researchers present a look back at these practices and reflect on the ethics of their use.

Hôtel du Temps TV show with Thierry Ardisson for France 3 (2022) © France TV, François Roelants

Like our body and face, our voice is the most intimate thing about us. It betrays our unconscious emotions, the undertones and unspoken words that the written word skillfully glosses over. On the surface, the voice cannot lie. And yet, like our body and face, we can dress it up and make it up. The manipulation of intonation is as old as rhetoric, just as make-up and tattoos are as old as mankind, for whom self-transformation techniques are second nature. It is precisely because the voice is a faithful and transparent image of our inner selves that its transformation acquires a particular power. This is also why recent advances in digital voice processing, made possible by artificial intelligence, can seem so disturbing.

Speech conversion techniques make it possible to impart certain attributes of a target speaker's voice, such as age, gender, affective tonality and even vocal identity, to the speech of a source speaker. These technologies are already very familiar to researchers in IRCAM's Sound Analysis and Synthesis (A/S) team, who have been involved in their development and application in music research and the cultural industries for several decades. We speak of a "vocal filter" when the manipulation preserves the vocal identity of the source speaker, and of a "vocal clone" when the identity of the target speaker is substituted for that of the source, creating the illusion of having one say what has been said by the other.

The notion of fake also implies an intention to deceive, by presenting as a recording what in reality results from the application of a filter or cloning. Deepfakes are fakes based on recent deep learning techniques, whereby very large networks of computing units, called "neurons" because they are inspired by simplified models of biological neurons, learn complex tasks, such as recognizing a voice, directly from a large mass of examples (the learning base).

"Thanks to deep networks, researchers generated a synthesized voice for the first time in 2017, one for which human listeners couldn't tell the difference between the real and the artificial. It's as if AI had succeeded in passing the Turing test for voice". 

Nicolas Obin, A/S team researcher 

As Axel Roebel, head of the A/S team, points out, "the price to be paid for these successes was the use of much larger databases and greater computing power".

Such new possibilities inevitably raise ethical issues. The creation of voice databases to train deep networks already raises a first series of questions. If the models learn from examples collected on the Internet, the representations of voice learned by these algorithms will be as biased as the data that feeds them. For example, if the data comes mainly from American, male, white speakers of English, the algorithms will build more reliable representations of voices speaking that English, rather than those of African-American women, and a fortiori of speakers expressing themselves in languages that spontaneously elicited little or no data. If documentary or creative applications of speech conversion technologies prove less effective for voices from speech cultures under-represented in the digital world, the use of these technologies threatens to accentuate these imbalances and generate what are known as algorithmic inequities. The European Linguatec AI project aims to develop neural architectures adapted to limited computational situations and linguistic resources, so as to ensure better processing of rare languages such as Aragonese, Catalan, Basque, and Occitan.

Alongside the problems of algorithmic fairness that are consubstantial with deep learning techniques, there are issues specific to speech conversion technologies, starting with speech cloning. The key issue here is to determine in which application frameworks it is justifiable to confer the vocal identity of a target speaker on a sentence spoken by a source speaker. This is a question that arises every time a member of the A/S team responds to a commission from the cultural industries, as was recently the case with the voice of Dalida, for Thierry Ardisson's program l'Hôtel du temps, or of General de Gaulle reading the Appeal of June 18 for the newspaper Le Monde.

First of all, there is the notion of consent. Both source and target speakers must consent to the voice identity conversion operation before its result can be circulated. As obvious as this principle may seem, the availability of voice cloning applications requiring no particular IT expertise makes it very easy to violate, as shown by the recent example of David Guetta's use in concert of an unauthorized voice clone of the rapper Eminem. While these first instances of unauthorized cloning may still seem like tolerable media stunts, insofar as they are not commercially exploited, large-scale exploitation is in fact made possible by the state and widespread availability of technology.

Singer Holly Herndon has decided to get ahead of her cloners by making a digital clone of her voice, called Holly+, available for creative use, on condition that it is traceable, approved and that a share of the revenues generated is paid back to her. The existence of such an "ethical" model for voice cloning does not, however, negate the temptation of unbridled use. For better or for worse, we can expect vocal cloning, whether consensual or not, to become widespread in music production, following a trajectory comparable to that of auto-tune in popular music. The prospect of post-scarcity society looms, where for every song there could be a version sung by one's favorite performer.

"The challenge for creation is not so much to explore more and more possibilities as it is to produce something rare. And therefore, unlike techno-creativity, to freely choose what it won't do".

Frank Madlener, director of IRCAM

In some cases, it is absolutely impossible to ensure the cloned person's consent. For what appear to be deep-rooted anthropological reasons, a significant number of voice clonings involve "awakening" deceased voices. How can the principle of consent be fulfilled in such cases? From a legal point of view, the solution lies in securing the consent of the rightful owners. While French law does not recognize the existence of a right to images after death, case law does consider that the broadcasting of images of the deceased can cause harm to their heirs if the memory and respect due to the deceased have been violated. But referring to the consent of rightful claimants doesn't solve all the problems. Deepfake technologies can also be used by ill-intentioned rightful claimants to correct an ancestor's bad reputation by attributing laudable, but fictitious, statements to them.

To rule out this type of usage, we can use a principle of authority: it is legitimate to have a text artificially delivered by a deceased person only if he or she is the author, i.e. if he or she said or wrote it during his or her lifetime. This was the case with Dalida's answers to Thierry Ardisson's questions. However, this limitation to the words and writings of the target speaker leaves plenty of latitude for speech conversion. The information contained in a text, i.e. a series of words, is always poorer than that contained in its oral enunciation. A speaker can considerably alter the information transmitted simply by varying his or her intonation. "Vocal cloning applied to a text presupposes an act of interpretation", as Nadia Guerouaou, a neuroscience researcher in IRCAM's Perception and Sound Design team, points out. To recreate the appeal of June 18, a vocal interpretation of the speech was first recorded by actor François Morel. Scientists from the A/S team then modified the speaker's identity using a conversion model trained on various recordings of General de Gaulle from the 1940s. Axel Roebel calls this "interpreted reconstruction". Alongside the principle of authority, we must therefore add a principle of fidelity governing the interpretation that accompanies speech synthesis.

In some cases, the boundary between historical and creative interpretation is not clear-cut, meaning that the principle of fidelity has to be adapted accordingly. One example is the voice recording of Marilyn Monroe's diary in Philippe Parreno's film Marilyn (2012), also produced by the A/S team. As Nicolas Obin recalls, "the vast majority of the recordings of Marilyn used to learn statistical models come from her films, i.e. not from Marilyn herself but from the often exuberant Marilyn character, while the director was looking for the more personal, intimate Marilyn, of which we have no recorded trace. It took a great deal of experimentation and adaptation to achieve the intimacy we wanted. When the film was first screened, those close to Marilyn were overwhelmed to find her as they had known her.

However, the principles of authority and loyalty are not always enough to ensure the legitimacy of vocal cloning. This is one of the lessons to be learned from the uproar caused by the documentary Roadrunner (2021), which chronicles the life of chef and TV host Anthony Bourdain until his suicide in 2018. Director Morgan Neville made Bourdain himself the film's narrator, relying on archival recordings and, with a few exceptions, sentences reconstructed by voice cloning. In the film's most controversial passage, Bourdain's voice is heard reading from one of his e-mails, in which he expresses his unhappiness: "Like me, you're successful. But are you happy? These are Bourdain's own words, rendered with an appropriately affective tone. What caused the scandal on social networks was the lack of transparency regarding this vocal cloning, which Neville didn't see fit to reveal, leaving the impression that it was a recording. On the contrary, this principle of transparency was central to Parreno's Marilyn, which tackled fairly similar themes, but where the process was clearly explained in the film's presentation text.

Surprisingly, this principle of transparency is not always necessary for the acceptability of certain uses of voice filters. Nadia Guerouaou has explored the therapeutic uses of filters to modulate the emotion expressed by the voice in real-time, making it more joyful or sad: "These filters may have applications in imaginal exposure therapy to the traumatic event for patients with post-traumatic stress disorder. The idea is to use real-time vocal feedback to modulate the emotional charge associated with the memory trace of the traumatic event, and thus transform it". In a study of experimental ethics, she probed the social acceptability of this type of filter and observed that its use was widely accepted by subjects, even in cases where its use remained hidden from the recipient of the filtered voice.

In connection with these results, Nadia Guerouaou wonders "whether it makes any ethical difference to make these changes by computer rather than through human learning, such as voice coaching". Voice filters also enable effortless transformations that no intensive training can make possible. There are lows and highs that no amount of vocal training can achieve. We can assume that for such uses, the principle of transparency remains relevant.

The conversion of voice attributes also raises specific ethical questions linked to the definition of the attributes under consideration. What does it mean to feminize a voice? There are average acoustic differences between men's and women's voices, due in part to physiological differences, but there is also a great deal of inter-individual variability within each category, as well as plasticity of the vocal apparatus on an individual scale. Often, feminizing a voice means bringing it closer to a socially constructed stereotype of a female voice. This is not to say that voice conversion algorithms are doomed to perpetuate sexist biases.

© Les Amours d’Astrée et de Céladon by Éric Rohmer (2007)

To digitally feminize Céladon's voice, when he cross-dresses as a woman in Éric Rohmer's Les amours d'Astrée et de Céladon, A/S's scientists applied intermediate values between the male and female averages, so as to convey acoustically the character's gender ambiguity, a choice that found a favorable echo among Le Monde's critics, who noted "a queer tone as pleasing as it is unexpected". However, the designers of voice conversion software for the cultural industries and the general public have a responsibility: if the voice conversion functions are limited to clicking on a "feminize" button, without the user having any control over the parameters that define this femininity, there is a risk that the application will lead to the perpetuation of gender biases associated with the voice.

This brief overview of the ethics of voice conversion algorithms shows that computer science cannot ignore the human and social implications that accompany its technical advances. Whether it's reconstructing the sound of the past—creating new vocalities or using the voice to heal—the choices made in terms of engineering and use, at IRCAM and elsewhere, continually retrace the contours of our vocal identities. More than ever, we need to think critically about these tools and their uses.

By Pierre Saint-Germier, philosopher (CNRS)

(1) Sur la problématique de l’équité algorithmique, voir Cathy O’Neill, Algorithmes. La bombe à retardement, Paris, Les Arènes, 2018.
(2) Guerouaou N, Vaiva G, Aucouturier J-J. 2021 The shallow of your smile : the ethics of expressive vocal deep-fakes. Phil. Trans. R. Soc. B 377 : 20210083. Voir à ce sujet Camille Pierre, « Voix trafiquées au cinéma, un rappel aux normes ? Le plug-in TRaX à l’œuvre dans Les Amours d’Astrée et de Céladon (Éric Rohmer, 2007) et Les Garçons sauvages (Bertrand Mandico, 2018) », Semen, 51 | 2022, 39-54.
(4) This article is based on interviews with Nadia Guerouaou, Frank Madlener, Nicolas Obin and Axel Roebel. Our warmest thanks to them all.