Recent challenges have shown that it is extremely difficult to develop universal deep-fake detectors - such as the ones used to impersonate the identity of an individual. When the detectors are exposed to videos generated by a new algorithm, i.e. that is unknown during the learning phase, performance is still extremely limited.
For the video part, the algorithms examine the images one by one, without taking into account the evolution of the facial dynamics over time. For the voice part, the voice is generated independently from the video; in particular, the audio-video synchronisation between the voice and the lip movements is not taken into account. This is a major weakness of deep-fake generation algorithms. The DeTOX project aims at implementing and learning custom deepfake detection algorithms for specific individuals for whom many real and fake audio-video sequences are available and/or fabricated. Based on basic audio and video technology bricks recovered from the state of the art, the project will focus on taking into account the temporal evolution of audio-visual signals and their coherence for generation and detection. In this way, we wish to demonstrate that by using audio and video simultaneously and by focusing on a specific person during learning and detection, it is possible to design efficient detectors even for as yet unclassified generators.
Such tools will make it possible to scan and detect possible deep-fakes videos of important French personalities (president of the republic, journalists, chief of the army, ...) on the web as soon as they are published.