In a context where personal assistants and interactions with machines are becoming a part of our daily realities, voice has become the privileged modality of interaction with the machine. Voice synthesis has made enormous progress in recent years, particularly through the use of deep learningand large multi-speaker databases. However, there are two principal limitations. The first is low expressivity: the agent’s behavior is still often monomodal (voice, such as seen in the assistants Alexa or Google Home) and remains very monotonous, which greatly reduces the acceptance, length of time, and quality of interactions. The second limitation is that the agent’s behavior is poorly, or not at all, adapted to the speaker and to the situation, which reduces their understanding of the information and reaction time to the information transmitted.
The MoVE project will develop neural learning algorithms to adapt the speech style of a synthetic voice to a specific interaction situation, with, for example, a focus on the attitudes of the synthesized voice (cordial, smiling, authoritative, etc.). The improved adaptation of the voice style will result in a better understanding of the information communicated by the agent and will reduce human reaction time to the information provided (e.g. in an emergency situation).
IRCAM's Team : Sound Analysis-Synthesis