Creation
News

Seth Scott / Points of Articulation

The artistic residency blog

Composer, sound designer and professor of Electronic Music and researcher at the Guildhall School of Music and Drama in London, Seth Scott has chosen the Sound Analysis-Synthesis team of the STMS lab at IRCAM to conduct his artistic research residency, which explores the politics and poetics of synthesised speech.

It is an age-old dream: already in the time of the Pharaohs, Egyptians were making statues “talk”. Later on, Homer described what looked like talking automatons, designed by the god Hephaestus, and in the 3rd century, scholars such as Philon of Byzantium, Ctesibius or Hero of Alexandria imagined the first machines, which were soon given a voice. Seth Scott’s artistic research residency falls within the scope of that research. Scott is no stranger to working on mixed, interactive or unconventional projects, whether picture-oriented or especially designed for showrooms. For a few years now, he has been working on a project that has as much to do with the history of art and techniques than with composition. Scott, in fact, entertains an interest for talking machines, and especially their contemporary counterparts like speech synthesis, that the latest models of machine learning have made possible – often improperly referred to as “artificial intelligence”. 

Photo credits 1 and 2: Performance by Seth Scott at the MO Museum in Vilnius © Tomas Terekas

“I dove deep into the history of human voice generation. When I started my research, I was amazed to discover how old the first attempts at speech synthesis actually are. It dates back to Ancient history, like the Colossi of Memnon, who were told by some to be making the sound of a human voice. This interest for the human voice is also at the very core of my compositional work, which often deals with the notion of speech; I focus on the spoken word, on setting text to music… That is what led me to do a job for an organization of the City of London, which was to write them a sort of guided tour (involving an unreliable narrator) taking us from a meat market to the London Stock Exchange and exploring the rather obscure similarities between the two along the way.”

“What’s interesting to me, is as much the technologies involved as the relations we entertain with them, which sometimes carry more desires, fantasies and fears, than truths – and this well before the recent emergence of artificial intelligence. This shows just how our relation to knowledge as a whole has changed, as well as how we comprehend the physical phenomenon of the voice. We see in fact that, in every age, talking machines have relied on how people believed speech worked at the time. The creation of the first automatons in the Baroque era coincides with the first scientific breakthroughs concerning anatomy, leading engineers to try to recreate the different parts that doctors back then imagined to be enabling speech. In the 19th century, the science of acoustics emerged, and machines began to draw on this new understanding of the physical nature of sound: reproducing the voice became an attempt to apprehend the very way it resonates. With machine learning, we go back to what I would qualify as an anatomical imitation, even though it’s not about modelling how vocal cords and the larynx works anymore, but how the brain does. In a way, we are reverting back to the rather crude method that was used in the Baroque era.”

That is when Scott heard about the call for applications for artistic research residencies at IRCAM: “I had already used tools that were developed at IRCAM, such as RAVE, a fantastic synthesis model based on deep learning, but I knew that I was not up to date with their newest technological developments. Before even starting this residency, I got the opportunity to talk with Axel Roebel about our respective expectations, the challenges to overcome and the experimental dimension of the tools developed at IRCAM.” When he got to the studio however, Scott was very surprised by just how refined these new tools were. He explains: “They aren’t all fully functional yet, but what they can accomplish is absolutely insane! Nine times out of ten, when I am asking the machine to say something, she does it. I found it almost disappointing: it was almost too close to a real voice.”

“Such high-quality results brought new questions, of a more philosophical or ethical nature. If those results were so good that they came very close to what could be accomplished with a real comedian, was it still useful from a creative standpoint? Even more so considering that, just like any technology, machine learning isn’t neutral: if anything, it carries political and ethical implications that are quite frankly worrying. This meant that I needed a clear rationale to justify its use: it was unconceivable for me to use machine learning for something that could be done in a different way – one that would not imply any substantial ecological or social impact.” This new challenge is even more difficult to tackle because, unlike other technologies, machine learning follows probabilistic laws, and therefore generates a median result that is both boring and flat.

We are all familiar with the “glitch” aesthetic that can be generated by hijacking a technical tool like filters or plugins. For this work, I had to find how to purposefully recreate these little “accidents”. My first impulse was to use a model to generate datasets that it hadn’t been trained on. It’s also interesting to play with the “probabilistic” dimension of the models. In the first version of my script, I could only indicate in my prompt information about the spoken text, and not the tone of voice. That is how I discovered that the content of a text could actually determine the tone of the voice: for instance, if I make the machine say “Once upon a time”, chances are that the voice generated will sound like a tale-teller. Same goes for academic texts… This observation highlights just how biased datasets can be.”

“In the second version of my script, I was able to enter a specific text and an audio sample to use as an example. This way, I was able to generate a speech that was completely at odds with the audio context and the tone of voice, and get very interesting results, especially when my prompt didn’t match the specific characteristics of the dataset that was used to train the machine.”

Seth Scott went so far as to torture the software, by imagining extreme prompts to test its stamina or its ability to transcend genre identity, or by writing some rhythms from textual sequences. The last version of the model developed by the Sound Analysis-Synthesis team allows us to also play with the level of emotion detectable in the voice – but its configuration is still at the experimental stage.

His goal is to eventually use this work as a starting point to compose new music. “I started off by making sketches. I am not one of those composers who can create from nothing; I need first to generate some kind of material, often more than I need, and much more than I can realistically handle. I then manipulate this material to create something new. For the moment, I am only focusing on small sections, because the software can only generate a maximum of 30 seconds of speech at a time – the researchers from the team are of course still working on its development to improve this aspect. Our biggest issue is that the software needs to “remember” what it has already created in order to generate the rest, which means that the longer the material needs to be, the more complicated it gets. But to me, that is only a structural issue, which I can easily get around.”

Seth Scott is employing machine learning tools at three different stages of his creative process. First, to generate code:

“I sometimes use AI models (such as ChatGPT), for example to develop my own programmes in Python. But the one I use the most is Max, even though its environment isn’t designed for processing speech.”

Then, to generate semantic content – the tools he is developing can for instance collect words from existing databases that have been used to train specific models (such as those developed by the Sound Analysis-Synthesis team). The last stage is naturally about generating sound. Each stage requires working simultaneously on the databases and on algorithmic script writing, which raises intellectual property issues.

“I like to play with databases. I even use some that have been used in the past to develop speech synthesis models, before AI even existed.”

While all these – sometimes very demanding – processes take up most of the time Scott dedicates to his research and compositional work, another issue, probably even more essential, already rises: “The tools we are using are really groundbreaking, cutting-edge technology, but all technology is doomed to expire, so how will the piece I am working on stand the passing of time?” 

Jérémie Szpirglas