By decomposing an audio recording, researchers managed to animate a face but also and above all to apply the emotions transcribed since the audio recording. A find that could improve the graphics in games but also, alas, produce videos type “deepfake” more and more realistic.
A team of researchers from Microsoft has just published an article in which they describe a new system developed to animate faces based solely on an audio recording. This new method takes advantage of advances in the field of deep learning, or deep learning, to create a video of a face that speaks by translating the emotions detected in the voice.
Face animation based on a recording is not completely new, but current methods assume that the sound sample is clear, with no background noise, and with a neutral tone.
The new system uses a variational auto-encoder (VAE) that learns to unravel the various components of the audio recording, including the phonetic part, emotional tone, and background noise. This allows it to be much more robust and to be able to create animations from more natural recordings.
Many applications in dubbing and 3D animation
The audio track is broken down into representations that can then be used with different animation methods. However, researchers at Microsoft have used generative antagonist networks (GAN), two neural networks in competition, to create their videos. This allows them not only to animate a face that “speaks”, but also to infuse the emotions transcribed since the audio recording.
Like any technology, it could be abused to create deepfakes. However, the authors focus on more useful applications, such as dubbing a video in another language, 3D avatars generated in real-time, or to improve character animations in video games.