In the context of the HOULE project, IRCAM suggests innovative unsupervised learning techniques to solve the problem of the structuring of audio scenes (Computational Auditory Scene Analysis: CASA); a problem that consists of identifying and describing sound objects that make up a scene. CASA is a very active domain of sound processing with numerous applications. The points that block the current methods are:
- Without being able to draw up a hypothesis on the nature of the processed objects, we can not model them
- The objects combined in the scene can not be observed isolated from each other
- The structure of the objects is governed by numerous relationships that is difficult to prioritize
The characteristics that we will use for our approach are the hierarchal organization of audio scenes (atoms brought together in objects that are examples of classes such as “Piano A4”, itself an example of “Piano note”) and the redundancy present at all levels of this hierarchy. This redundancy enables us to identify reoccurring motifs on which a rich and robust representation can be based.
This new method of representing audio scenes has led to the creation of an unsupervised learning algorithm designed expressly to process this data. The system features two components: the MLG (multi-level grouping) that structures the data and the Supervisor (a reflexive adaptation module) embodies the learning aspect by optimizing the MLG function on-the-fly in reference to the stocked memories of past executions.
The originality of our proposition lies in its distance from traditional CASA approaches, beginning with the paradigm of representing scenes and objects. Innovation is primarily present in the unsupervised learning methods that we will develop, including applications that go well beyond the CASA framework.