Creation
News

The Call by Holly Herndon & Mat Dryhurst – Towards an Ethical and Open-Sourced AI 2/2

Focus on the Scientific Collaboration with Nils Demerlé

On exhibit from October 4, 2024 to February 2, 2025 at the Serpentine North Gallery in London as part of a collaboration between artists Holly Herndon and Mat Dryhurst and Serpentine Arts Technologies, The Call is a spatial audio installation built in two symmetrical parts, each containing a mic, and separated by a curtain. This interactive installation allows visitors to generate, by singing their own songs, whether improvised or not, English sacred choral music. Using deep-learning technologies and drawing on a vast dataset gathered especially for the project, The Call also proposes to reflect on the use of AI and its impact on our society.

In this last instalment of a two-part article, young PhD student in the ACIDS (Artificial Creative Intelligence and Data Science) project team Nils Demerlé – supervised by researcher Philippe Esling in the Sound Analysis-Synthesis team – shares his views on the project, for which he developed an AI model to generate choirs.

How did this artistic project come within the scope of your research?

Nils Demerlé: As it happens, when Holly and Mat came to find us at IRCAM, I already had a little experience with generating choir music. The ACIDS team organises concerts regularly in order to showcase what the models we are developing are capable of, and some deal with choir music. The model I was working on already had a vocal timbre transfer feature, or as it is more often called when dealing with the much more complicated case of choirs, “style transfer”. Originally, this model was only able to project a choral sound onto another one, and only in delayed time. However, Holly and Mat quickly asked me to create another model, one that only required a monophonic input and that needed to be interactive.

The model we have now generates an artificial choral song from two incoming audios streams; one that is used to define the timber and harmony of the target (an audio segment of 5-10 seconds – in our case, the sound of the choirs which serves as the sound environment of the installation), and a second one from which the pitch, rhythm and the lyrics are extracted (in our case – the voice of the participant). All of this process happens in real-time, with a maximum latency period of approximately 300ms. This latency forces us to generate a question-answering discourse whose structure is very similar to traditional choir singing. This way, when a participant sings into the mic, their song is simultaneously transformed and played in the same tone than the choirs which make up the sound environment.

The final patch, which will run on the software Max using the nn~ plug-in, owes a lot to RAVE (Realtime Audio Variational autoencoder), a tool developed by Antoine Caillon, researcher in the ACIDS team.

What challenges did you face in the development of this project?

N.D.: Our main difficulty concerned data mining, but it was also the most interesting part for us. Holly and Mat recorded about fifteen choirs across the United Kingdom. For each, we got the individual recordings from the mics of each singer as well as the recording of the full choir. Having this kind of data is absolutely ideal to train the model to produce a choral sound from a solo voice as, in a choir, every individual voice blends with the rest of the choir. We also had, depending on the songs, at least ten or fifteen takes. However, not all choirs had recorded every song, and so, because we needed a lot of data, we also ended up using the recordings of their warm-up sessions and various exercises.

Moreover, because they are amateurs, the choirs inherently have their flaws. Some singers do not always sing in tune and the quality of the recordings was not always ideal. Some questions naturally came up: “Do we need to keep everything? Should we rectify the pitch (which is very easy to do with autotune)? If we did not rectify these voices, the pitch of the generated song might suffer from it. If that was the case, did we need to rectify it afterwards?” What we finally chose to do was put aside approximately 30% of the recordings.

Having to generate a choral song from a solo voice posed another challenge. The dataset contains the individual voices of the singers as well as the full choir, but because some of them are middle voices, they are not necessarily representative of the choir as a whole. Also, we could hear the rest of the choir in the background of all the solo recordings – which meant that the model was going to get used to always hearing the choir behind each individual voice. In order to overcome this issue, we had to “cheat” a little by adding a harmonizer to the voice of the participant, which greatly helped in the process of generating full choirs. Of course, in our work as researchers, it is not something we could have ever done – or our articles would never go through peer review! But we have much more freedom when it comes to artistic projects.

As a researcher working with AI, what do you think makes Holly and Mat’s approach stand out from the rest?

N.D.: For me, I think it is the process they went through of collecting the dataset themselves in order to train their model and be able to showcase it. They have been working with AI tools for several years now, and every time, they truly have at heart to honour the artists hiding behind the models. This process is the only way to faithfully reproduce the unique sound identity of English choirs.

Is there such a thing as a developer’s ethic?

N.D.: It is something I think about a lot. A very important question that comes back often in our field of research is: “Do I really want to generate music in a way that will pose a great threat to the whole music industry? How to compensate the artists who produced all this data? Is “data” even the right term to use in this case? I would say that this project rather confirmed my views on this matter. I have stopped training models on data that I do not own the rights to for a while now and I do not plan on going back. As for this project, I do not believe that the tools we are developing present the same issues as other generative models like ChatGPT. I rather see them as instruments that are at the artists’ disposal than as automatised music generators.

Finally, and this is something I am happy to announce, the method and code that I developed to enable the model to train on the dataset gathered by Holly and Mat will be published online and freely available to everyone.

Interview conducted by Jérémie Szpirglas

Photo 1: Nils Demerlé © IRCAM-Centre Pompidou, photo: Déborah Lopatin
Photo 3: David Genova and Nils Demerlé, PhD Students © Ircam-Centre Pompidou, photo: Déborah Lopatin