The Voice Filter: Towards the Antropotechnics of our Social Cognition

Nadia Guerouaou is a researcher in affective neuroscience with the Perception and Sound Design team of the STMS laboratory (IRCAM, Sorbonne University, CNRS, French Ministry of Culture), and a psychologist specializing in the treatment of post-traumatic stress disorder (PTSD). She is interested in the impact of new digital technologies on our social cognition, and in particular on our emotional inferences during social interactions. Here, she presents the fruits of her research.

In our day-to-day lives, we engage in social interactions in which we extract information, often unconsciously, about our interlocutors. This information can be more or less direct. It ranges from physical appearance to what we call "hidden states", which refer to the mental states of our interlocutor. Using a paralinguistic communication channel, for example, I can infer an individual's emotional state through the tone of his voice (Juslin and Laukka, 2003), discern social attitudes such as warmth or benevolence through his prosody (Ponsot et al, 2018), as well as metacognitive elements such as doubt about what he is saying (Goupil et al, 2021). These potential inferences even extend to my interlocutor's physiology, in particular their heart rate (Galvez-Pol, 2022). So, in the course of a social interaction, we infer, with varying degrees of accuracy, a considerable amount of information about our interlocutor. My thesis work took these emotional inferences as its object and aimed to study, by means of the Bayesian predictive inference model (Friston and Frith, 2015b), the potential influence of new computer technologies for controlling the emotional tone of voice on our social cognition.

According to Bayesian predictive inference - which calculates the degree of confidence to be placed in a hypothetical cause - our deductions about the emotional states of our interlocutors are based on an internal model of the world, conceived from our beliefs and past experiences acquired in our environment. Consequently, a particular inflection of voice (or facial expression) is not an emotion in itself, but becomes one through the meaning given to it by our cognition (Barrett, 2012). The brain, far from being a passive information-processing tool, could be constantly creating our perception of the emotions in our interlocutors' voices based on our beliefs, which could be conceived as probabilities (in a Bayesian model). For example, the fact that in Western society, smiling is associated with a state of joy (which is not the case everywhere in the world) (Niedenthal et al, 2018), stems from its inclusion in an internal model as a significant probability of associating smiling and positive emotion. This is a strong relationship that will guide our perceptions of other people's states. Generally speaking, the predictive inference model implies that what we perceive today is deeply rooted in what we experienced yesterday.

However, today, our experiences are influenced by two major transformations that mark our society. Firstly, in recent years, we have witnessed the emergence of technologies that can be used to control facial and vocal expressions, once considered natural, and which we associate with the emotional states of our interlocutors. Outside the laboratory, these techniques for shaping a live image of one's face - grouped together under the term "filter" - (although strictly speaking not a filtering technique) are generating a real craze on social networks. As for their voice counterparts - which I call "voice filters" in my work, following the same model - their use remains more limited for the time being. However, their use is already being explored, thanks in particular to technical advances and the recent popularization of voice notes or "vocals". Seven billion vocals are sent every day on WhatsApp alone, a craze which, according to Catherine Lejealle (sociologist and researcher specializing in digital uses) can be explained by the fact that they are much better at conveying the emotional content of a message.

On the other hand, in the digital age, our social interactions are increasingly mediated by technological tools. Visio-communication interfaces have become an integral part of our personal and professional lives, even in the fields of medicine and psychiatry, where consultations are largely based on oral communication. The extension of the digital sphere could potentially increase the scope of the first phenomenon.
Faced with this, my research questions the possibility that these voice filters could upset our internal model of the world and the inferences we make about the emotional state of our interlocutor during a social interaction. This potential - which can be described as "anthropotechnical" - of filters depends largely on their widespread use by the population, conditioned by society's acceptability of these new technologies.

We conducted an experimental ethics study to assess the moral acceptability of using computer technologies to parameterize the emotional tone of the voice (Guerouaou et al, 2021).The results of this study showed that the young French population was generally in favor of the idea of using voice filters to artificially modify the emotional tone of the voice, and that there was no social dilemma. So there seems to be no major obstacle to the widespread use of filters. In the light of the predictive inference model, it seems crucial to consider the possible long-term effects of widespread use of voice filters. Among these effects, to name just one, widespread use of this type of filter could recalibrate our expectations about the expression of this or that emotion in the voice. As a result, what currently seems "normal" because it's expected, such as hearing a slight tremor in the voice of a slightly nervous interlocutor, could become much less so in an environment that allows us to control this kind of vocal manifestation. This phenomenon, referred to as "norm creep" in the field of bioethics concerning augmentation technologies (Goffi, 2009), appears quite applicable to a technology such as the voice filter, which I propose in my work to observe as a technology of the self.

"What's more, if moral values can influence the use of technologies (the subject of our ethics study), it is also accepted that this use would itself have the potential to influence our moral landscape."

This effect is described under the term "soft impacts" of technologies, which refer to the way in which their introduction affects relationships, identities, norms and values in society (Van der Burg, 2009). The voice filter, in addition to shaping what might be called the external image of the self - I use a voice filter to appear more smiley to my interlocutor - would also be, in view of the results of our work, an object for shaping the internal models on which our social cognitions are based.

These tools raise important ethical questions that need to be addressed right from the design stage. In this respect, the work carried out at IRCAM on the question of the influence of computerized voice transformation technologies on cognition, in interaction with the researchers who work on the creation of these tools, seems to me to be among the proposals for reflection that would encourage the emergence of a collective wisdom (Andler, 2021) concerning the societal issues raised by the use of these new technologies of the self.

Andler, D. (2021). Technologies émergentes et sagesse collective. comprendre, faire comprendre, maîtriser. un vaste programme de plus ? Les cahiers de Tesaco.
Barrett, L. F. (2012). Emotions are real. Emotion, 12(3): 413–429.
Friston, K. J. and Frith, C. D. (2015b). Active inference, communication and hermeneutics. Cortex, 68: 129–143.
Galvez-Pol, A., Antoine, S., Li, C., and Kilner, J. M. (2022). People can identify the likely owner of heartbeats by looking at individuals’ faces. Cortex, 151: 176–187.
Goffi, J.-Y. (2009). Thérapie, augmentation et finalité de la médecine.
Guerouaou, N., Vaiva, G., and Aucouturier, J.-J. (2021). The shallow of your smile: the ethics of expressive vocal deep-fakes. Philosophical Transactions of the Royal Society B: Biological Sciences, 377(1841): 20210083.
Juslin, P. N. and Laukka, P. (2003). Communication of emotions in vocal expression and music performance: Different channels, same code? Psychological bulletin, 129(5): 770.
Niedenthal, P. M., Rychlowska, M., Wood, A., & Zhao, F. (2018). Heterogeneity of long-history migration predicts smiling, laughter and positive emotion across the globe and within the United States. PloS one, 13(8), e0197651
Van der Burg, S. (2009). Taking the “soft impacts” of technology into account: broadening the discourse in research practice. Social Epistemology, 23(3-4): 301– 316.