Judith Deschamps 2/4 : AI at all levels

Artistic Residencies : The Blog

In the Sound Analysis & Synthesis team, Judith Deschamps benefits from the combined talents of two researchers: research director Axel Roebel and doctoral student Frederik Bous, to recreate a realistic castrato voice. Here is a brief overview of their work in progress.

Nearly 30 years ago, three members of the Analysis/Synthesis team, Philippe Depalle, Guillermo Garcia and Xavier Rodet, had already succeeded in creating a 'Virtual Castrato' for Gérard Corbiau's film Farinelli, which they described in a very detailed article. They had recorded two singers (with orchestral accompaniment), a countertenor and a coloratura soprano. The principle of the synthesis was quite simple, but it was hard work : each voice corresponds to a register, and it was above all a question of keeping a coherent timbre throughout the range of the castrato, particularly in the medium register, when one goes from one to the other. They therefore amassed a large database of notes for each voice, sung by both singers, covering all vowels and three levels of intensity, which they analysed to identify the spectral characteristics corresponding to the timbre. Considering that the castrato's timbre was probably closer to that of the countertenor, they "adapted" the notes that were too high for him to be sung by the soprano, by manually correcting their timbres.

When you know how they did it," admits Axel Roebel, "the result is all the more impressive! The only real problem they found with this device, apart from its tediousness, was the feeling of sometimes hearing dynamic discontinuities in the sound.

In this new project, the aim this time is to 'augment' a voice by 'hybridising' it. For passages that fall within its own range, this voice is used as is. For pitches outside its range, the idea is to record the transposed passages at a singable pitch, then transpose them and transform them by borrowing the timbre of other voices. Six voices were therefore recorded singing the aria "Quell' usignolo che innamorato" by Geminiano Giacomelli (1662-1740), which Judith Deschamps wants to reconstruct: two children's voices, a soprano, an alto, a countertenor and a leggero tenor. Of these six voices, the alto has been retained, with the others being used to fill in the gaps in its pitch, thanks to neural networks or learning systems developed by Axel Roebel and doctoral student Frederik Bous.

The basic principle," says Axel Roebel, "is to make a deep model learn to reconstitute the timbre of a singer from a given signal and a target pitch, which will then make it possible to transpose the passages that the selected voice cannot reach. The idea seems obvious, but its implementation is much more complex.

To communicate the properties of a given voice to the networks, the researchers use a representation of that voice, condensed and sampled in time as an image, called a 'Mel-scale spectrogram' or 'Mel-spectrogram'. The Mel-spectrogram is easier for neural networks to manipulate because many details of the spectrum that are not relevant to our perceptions have been removed.

Judith Deschamps in the ircam studios

First step: 'Using the voice recordings, a neural network was trained to recreate a sound from its Mel-spectrogram,' explains Axel Roebel. By comparing the result with its model, we measure the 'loss' of quality linked to the process, which then allows the neural network to improve itself.

This principle was then reproduced, but with a pair of networks called "Autoencoder". The aim of the first neural network is to produce not a 'complete' Mel-spectrogram, but a reduced form of Mel-spectrogram that represents only the spectral content that does not concern the fundamental frequency of the sound, i.e. the sung pitch. It is said to 'unmix' the pitch from the rest of the information in the Mel-spectrogram. Again, this sounds obvious, but in reality it is not so obvious, since the timbre of a voice depends (also) on the pitch sung! This gives us what Frederik Bous calls the "residual code" of the sound, which "codes" the timbre, the phoneme, the vibrato, etc. This "residual code" ( residual code") is the result of the sound of the voice.????

This 'residual code' (or rather these residual codes, as the process is reproduced for all the sounds in the database) is used to train the second neural network: the latter's job is to reconstitute a complete Mel-Spectrogram from this residual code, and from a given frequency.

The first phase of the joint training of these two networks is done by feeding back into the second one exactly the same fundamental frequency as that of the original sample. Here again, this allows the result of the work to be compared with the original sound, giving the networks the possibility of learning from their mistakes and improving themselves (on the one hand to refine the production of the residual code, and on the other hand to adjust the recreation of the sound).

Through the way it works, the Autoencoder already allows transpositions to be made (by injecting a fundamental frequency different from the original), even if it has never yet learned to do so. But the more we change this frequency, the more the system has to modify the Mel-Spectrogram at the output.

"However, modifying also means inventing! And in order to improve these 'inventions' of the system, we can no longer compare them with an existing signal - because no singer is capable of singing the same melody transposed into another range. We therefore lack a "target" for the system. For this purpose, we will use a new network, to which we have taught the untouched voices. It is up to this new network, called the "discriminator", to "criticize" the inventions of our transposition tool by comparing them with what it knows, and to determine whether they are plausible or not.

Again, the two neural networks are used to train each other: the transposer 'transposes', the discriminator 'criticises', trying to guess whether or not the sounds produced are the result of a transposition by the former, and if so, to what extent. So that the first one produces more and more plausible sounds, and the second one becomes more and more precise in its criticisms... Hence, a win-win process.

This penultimate stage has only just begun at the time of writing. The hope," says Axel Roebel, "is that the neural network will improve the quality of its transpositions and limit losses until they are undetectable. At present, we are achieving this over two octaves. Ideally, we would like to be able to reach three and a half octaves.

The final step will be to do what we wanted to do from the beginning: to augment the singer by extending her singing to all pitches in a consistent and plausible way. "The principle," explains Frederik Bous, "is actually the same as that of the Deepfakes for faces and videos. By proceeding in this way," explains Axel Roebel, "we create a hybrid voice with no break in its timbre depending on in relation to the pitches."