Neural Emotion Director: Speech-preserving semantic control of facial expressions in “in-the-wild” videos

CVPR 2022 (Oral, Best Paper Finalist)

1School of Electrical & Computer Engineering, National Technical University of Athens, Greece 2Institute of Computer Science (ICS), Foundation for Research & Technology - Hellas (FORTH), Greece 3College of Engineering, Mathematics and Physical Sciences, University of Exeter, UK
method

NED enables the user to manipulate the emotion of a talking face in a video without distorting the speech-related lip motion.

Abstract

In this paper, we introduce a novel deep learning method for photo-realistic manipulation of the emotional state of actors in “in-the-wild” videos. The proposed method is based on a parametric 3D face representation of the actor in the input scene that offers a reliable disentanglement of the facial identity from the head pose and facial expressions. It then uses a novel deep domain translation framework that alters the facial expressions in a consistent and plausible manner, taking into account their dynamics. Finally, the altered facial expressions are used to photo-realistically manipulate the facial region in the input scene based on an especially-designed neural face renderer. To the best of our knowledge, our method is the first to be capable of controlling the actor’s facial expressions by even using as a sole input the semantic labels of the manipulated emotions, while at the same time preserving the speech-related lip movements. We conduct extensive qualitative and quantitative evaluations and comparisons, which demonstrate the effectiveness of our approach and the especially promising results that we obtain. Our method opens a plethora of new possibilities for useful applications of neural rendering technologies, ranging from movie post-production and video games to photo-realistic affective avatars.

Video

For a detailed presentation, check our full demo video.

Overview

Neural Emotion Director (NED) can manipulate facial expressions in input videos while preserving speech, conditioned on either the semantic emotional label or on an external reference style as extracted from a reference video.

method

First, we perform 3D facial recovery and alignment on the input frames to obtain the expression parameters of the face. Then, these parameters are translated using our 3D-based Emotion Manipulator, where the style vector is computed by either a semantic label (i.e., the emotion), or a driving reference video. Finally, the produced 3D facial shape is concatenated with the Normalized Mean Face Coordinate (NMFC) and eye images and fed into a neural renderer (along with previously computed frames), in order to render the manipulated photo-realistic frames.

Additional Results

Here, we provide additional results on YouTube videos.

1) Label-guided manipulation

1) Reference-guided manipulation

BibTeX

@inproceedings{paraperas2022ned,
      title={Neural Emotion Director: Speech-preserving semantic control of facial expressions in "in-the-wild" videos},
      author={Paraperas Papantoniou, Foivos and Filntisis, Panagiotis P. and Maragos, Petros and Roussos, Anastasios},
      booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
      year={2022}
     }