├── notes ├── moda.md ├── readme.md ├── example.md ├── StyleHEAT note.md └── sadtalker.md ├── advances ├── README.md ├── tutorials.md └── iccv23_papers.md ├── benchmarks ├── utils │ ├── __init__.py │ ├── __pycache__ │ │ ├── __init__.cpython-39.pyc │ │ └── video_processing.cpython-39.pyc │ └── video_processing.py ├── assets │ └── file_structure.png ├── test.ipynb └── readme.md ├── assets ├── StyleHEAT.png └── sadtalker.png ├── SUMMARY.md ├── memory.md └── README.md /notes/moda.md: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /advances/README.md: -------------------------------------------------------------------------------- 1 | # advances 2 | 3 | -------------------------------------------------------------------------------- /benchmarks/utils/__init__.py: -------------------------------------------------------------------------------- 1 | from .video_processing import count_sentences -------------------------------------------------------------------------------- /assets/StyleHEAT.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Jason-cs18/awesome-avatar/HEAD/assets/StyleHEAT.png -------------------------------------------------------------------------------- /assets/sadtalker.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Jason-cs18/awesome-avatar/HEAD/assets/sadtalker.png -------------------------------------------------------------------------------- /benchmarks/assets/file_structure.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Jason-cs18/awesome-avatar/HEAD/benchmarks/assets/file_structure.png -------------------------------------------------------------------------------- /benchmarks/utils/__pycache__/__init__.cpython-39.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Jason-cs18/awesome-avatar/HEAD/benchmarks/utils/__pycache__/__init__.cpython-39.pyc -------------------------------------------------------------------------------- /benchmarks/utils/__pycache__/video_processing.cpython-39.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Jason-cs18/awesome-avatar/HEAD/benchmarks/utils/__pycache__/video_processing.cpython-39.pyc -------------------------------------------------------------------------------- /benchmarks/utils/video_processing.py: -------------------------------------------------------------------------------- 1 | def count_sentences(text_url): 2 | """ 3 | Counts the number of sentences in a given text. 4 | :param text: The text to count the sentences in. 5 | :return: The number of sentences in the text. 6 | """ 7 | with open(text_url, 'r') as f: 8 | text = f.readlines() 9 | 10 | return len(text[0].split(".")) -------------------------------------------------------------------------------- /SUMMARY.md: -------------------------------------------------------------------------------- 1 | # Table of contents 2 | 3 | * [awesome-avatar](README.md) 4 | * [advances](advances/README.md) 5 | * [ICCV'23, Oct 4-6, 2023](advances/iccv23\_papers.md) 6 | * [Paper note](notes/readme.md) 7 | * [Title](notes/example.md) 8 | * [moda](notes/moda.md) 9 | * [SadTalker: Learning Realistic 3D Motion Coefficients for Stylized Audio-Driven Single Image Talking](notes/sadtalker.md) 10 | -------------------------------------------------------------------------------- /notes/readme.md: -------------------------------------------------------------------------------- 1 | # Paper note 2 | All notes follow the style of [DeeplearningAI](https://www.deeplearning.ai/the-batch/tag/research/). 3 | 4 | - [Note Template](https://github.com/Jason-cs18/awesome-avatar/blob/main/notes/example.md) 5 | - [SadTalker CVPR'23](https://github.com/Jason-cs18/awesome-avatar/blob/main/notes/sadtalker.md) 6 | - [MODA ICCV'23](https://github.com/Jason-cs18/awesome-avatar/blob/main/notes/moda.md) 7 | -------------------------------------------------------------------------------- /memory.md: -------------------------------------------------------------------------------- 1 | # Memory-Augmented Talking Face Synthesis 2 | - [arXiv 2020.10] [Audio-driven Talking Face Video Generation with Learning-based Personalized Head Pose](https://arxiv.org/pdf/2002.10137.pdf) | Tsinghua University 3 | - [AAAI'22 Oral] [SyncTalkFace: Talking Face Generation with Precise Lip-Syncing via Audio-Lip Memory](https://arxiv.org/abs/2211.00924) | KAIST 4 | - [arXiv 2022.12] [Memories are One-to-Many Mapping Alleviators in Talking Face Generation](https://arxiv.org/pdf/2212.05005.pdf) | Shanghai Jiao Tong University 5 | - [ICCV'23] [EMMN: Emotional Motion Memory Network for Audio-driven Emotional Talking Face Generation](https://openaccess.thecvf.com/content/ICCV2023/papers/Tan_EMMN_Emotional_Motion_Memory_Network_for_Audio-driven_Emotional_Talking_Face_ICCV_2023_paper.pdf) | Shanghai Jiao Tong University 6 | - [arXiv 2023.05] [GeneFace++: Generalized and Stable Real-Time Audio-Driven 3D Talking Face Generation](https://arxiv.org/abs/2305.00787) | Zhejiang University 7 | -------------------------------------------------------------------------------- /benchmarks/test.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 7, 6 | "metadata": {}, 7 | "outputs": [ 8 | { 9 | "name": "stdout", 10 | "output_type": "stream", 11 | "text": [ 12 | "49\n" 13 | ] 14 | } 15 | ], 16 | "source": [ 17 | "import utils\n", 18 | "\n", 19 | "sample_transcript_file = \"/home/jason/Downloads/transcript (3).txt\"\n", 20 | "\n", 21 | "print(utils.count_sentences(sample_transcript_file))" 22 | ] 23 | }, 24 | { 25 | "cell_type": "code", 26 | "execution_count": null, 27 | "metadata": {}, 28 | "outputs": [], 29 | "source": [] 30 | } 31 | ], 32 | "metadata": { 33 | "kernelspec": { 34 | "display_name": "pytorch", 35 | "language": "python", 36 | "name": "python3" 37 | }, 38 | "language_info": { 39 | "codemirror_mode": { 40 | "name": "ipython", 41 | "version": 3 42 | }, 43 | "file_extension": ".py", 44 | "mimetype": "text/x-python", 45 | "name": "python", 46 | "nbconvert_exporter": "python", 47 | "pygments_lexer": "ipython3", 48 | "version": "3.9.17" 49 | } 50 | }, 51 | "nbformat": 4, 52 | "nbformat_minor": 2 53 | } 54 | -------------------------------------------------------------------------------- /notes/example.md: -------------------------------------------------------------------------------- 1 | ## Title 2 | Use 1-2 sentences to highlight the key contributions of the paper. 3 | 4 | **what's new:** The authors (xxx and xxx from xxx) proposed xxx to address/achieve xxx. 5 | 6 | **key insights:** Previous works leverages xxx to achieve xxx but they are limited by xxx. To overcome xxx, authors designed xxx. 7 | 8 | **How it works:** The authors designed three-stage pipeline named xxx to generate xxx. In the first, xxx-1. In the second, xxx-2. In the end, xxx-3. 9 | - xxx-1 is trained with xxx. Compared with existing works, xxx is better than xxx. It is because that xxx. 10 | - xxx-2 is trained with xxx. Compared with existing works, xxx is better than xxx. It is because that xxx. 11 | - xxx-3 is trained with xxx. Compared with existing works, xxx is better than xxx. It is because that xxx. 12 | 13 | **Results:** The authors evaluated xxx on xxx. Compared with xxx, xxx is better on xxx. But it is worse than xxx on xxx. It is because that xxx. 14 | 15 | **Why it matters:** This work reveals that xxx but xxx. Such insights deepen our understanding of xxx and can help practitioners explain their outputs. 16 | 17 | **We're thinking:** xxx-1 and xxx-2 may be useful for other pipelines because xxx. 18 | 19 | -------------------------------------------------------------------------------- /notes/StyleHEAT note.md: -------------------------------------------------------------------------------- 1 | # StyleHEAT 2 | 3 | ## StyleHEAT 4 | 5 | Propose a unified framework based on a pre-trained Style-GAN for one-shot talking face generation,which use audio or video driving a source image. 6 | 7 | **what’s new:** 8 | 9 | Fei Yin from Tsinghua University and researchers from Tencent AI Lab have released a new generation pipeline named StyleHEAT to synthesis diverse and stylized talking-face videos 10 | 11 | **key insights:** 12 | 13 | Previous work was unable to generate high-resolution speaking videos due to dataset, efficiency, and other limitations. To overcome the low quality of video resolution,the author verified the spatial characteristics of sytleGAN and added it to the generation framework to achieve high-resolution face speaking video generation, and can edit it by attributes. 14 | 15 | **How it works:** 16 | 17 | The authors designed Video-Driven , Autdio-driven Motion Generator and Feature Calibration to achieve the unified framework. 18 | 19 | - First it obtain the style codes and feature maps of the source image by the encoder of GAN inversion. 20 | - Second the video or audio along with the source image are used to predict motion fields by the corresponding **motion generator**. The motion generator output the desired flow fields for feature warping 21 | - Driving-Video Motion Generator use 3DMM parameters as the motion representation,and the network is based on U-Net 22 | - Driving-Audio Motion Generator use the Mel-Spectrogram,use an MLP to squeeze the temporal dimension. The network is the via AdaIN 23 | - Then,the selected feature map is warped by the motion fields, followed by the **calibration network** for rectifying feature distortions. 24 | - The refined feature map is then fed into the StyleGAN for the final face generation. 25 | 26 |  27 | 28 | **Results:** 29 | 30 | Authors train the two motion generators on the VoxCeleb dataset,while joint training the whole framework on the HDTF dataset.This model output is a 1024×1024 image generated by the pre-trained styleGAN . Authors evaluated the generation picture with the other work,containing reappearance of the same character and different characters.Compared with other methods qualitatively and quantitatively, this method generates images that are more natural and have higher resolution. In terms of lip sync, more details and higher resolution than wav2lip. 31 | 32 | **Why it matters:** 33 | 34 | This work proposes a unified framework based on a pre-trained Style-GAN for one-shot high-quality talking face generation.Such insights provide us with a new method and idea to improve the generation resolution. 35 | 36 | **We’re thinking:** 37 | 38 | We can achieve the last step of generating video through GAN inversion-styleGAN inference, replacing the last step FaceRender in sadtalker 39 | -------------------------------------------------------------------------------- /advances/tutorials.md: -------------------------------------------------------------------------------- 1 | # Courses, Talks and Tutorials 2 | ## Courses 3 | ### Learning-Based Image Synthesis 4 | - Basic info: CMU 16-726, Spring 2023; 5 | - Instructor: [Jun-Yan Zhu](https://www.cs.cmu.edu/~junyanz/); 6 | - Website: https://learning-image-synthesis.github.io/sp23/ 7 | 8 | ## Tutorials 9 | ### Video Synthesis: Early Days and New Developments 10 | - Conference: ECCV'22 (Oct 24, 2022) 11 | - Organizers: [Sergey Tulyakov @ Snap Research](http://www.stulyakov.com/), [Jian Ren @ Snap Research](https://alanspike.github.io/), [Stéphane Lathuilière @ Telecom Paris](https://stelat.eu/), and [Aliaksandr Siarohin @ Snap Research](https://aliaksandrsiarohin.github.io/aliaksandr-siarohin-website/); 12 | - Website: https://snap-research.github.io/video-synthesis-tutorial/; 13 | 14 | ### Efficient Neural Networks: From Algorithm Design to Practical Mobile Deployments 15 | - Conference: CVPR'23 (June 18, 2023); 16 | - Organizers: [Jian Ren @ Snap Research](https://alanspike.github.io/), [Sergey Tulyakov @ Snap Research](http://www.stulyakov.com/), and [Eric Hu @ Snap](https://www.linkedin.com/in/erichuju/); 17 | - Website: https://snap-research.github.io/efficient-nn-tutorial/; 18 | 19 | ### Full-Stack, GPU-based Acceleration of Deep Learning 20 | - Conference: CVPR'23 (June 18, 2023); 21 | - Organizers: [Maying Shen @ NVIDIA](https://mayings.github.io/), [Jason Clemons @ NVIDIA Research](https://scholar.google.com/citations?user=J_1GGJsAAAAJ&hl=zh-CN), [Hongxu (Danny) Yin @ NVIDIA Research](https://hongxu-yin.github.io/), and [Pavlo Molchanov @ NVIDIA Research](https://www.pmolchanov.com/); 22 | - Website: https://nvlabs.github.io/EfficientDL/; 23 | 24 | ### Denoising Diffusion Models: A Generative Learning Big Bang 25 | - Conference: CVPR'23 (June 18, 2023); 26 | - Organizers: [Jiaming Song @ NVIDIA Research](https://tsong.me/), [Chenlin Meng @ Stanford](https://cs.stanford.edu/~chenlin/), and [Arash Vahdat @ NVIDIA Research](http://latentspace.cc/); 27 | - Website: https://cvpr2023-tutorial-diffusion-models.github.io/; 28 | 29 | ### Prompting in Vision 30 | - Conference: CVPR'23 (June 18, 2023); 31 | - Organizers: [Kaiyang Zhou @ NTU](https://kaiyangzhou.github.io/), [Ziwei Liu @ NTU](https://liuziwei7.github.io/), [Phillip Isola @ MIT](http://web.mit.edu/phillipi/), [Hyojin Bahng @ MIT](), [Ludwig Schmidt @ UW](https://people.csail.mit.edu/ludwigs/), [Sarah Pratt @ UW](https://sarahpratt.github.io/), and [Denny Zhou @ Google Research](https://dennyzhou.github.io/) 32 | - Website: https://prompting-in-vision.github.io/; 33 | 34 | ### All Things ViTs: Understanding and Interpreting Attention in Vision 35 | - Conference: CVPR'23 (June 18, 2023); 36 | - Organizers: [Hila Chefer @ Tel Aviv University](https://hila-chefer.github.io/) and [Sayak Paul @ Google Resaearch & Hugging Face](https://sayak.dev/); 37 | - Website: https://all-things-vits.github.io/atv/; 38 | 39 | ### Vision Transformer: More is different 40 | - Conference: CVPR'23 (June 18, 2023); 41 | - Organizers: [Dacheng Tao @ University of Sydney](https://scholar.google.com/citations?user=RwlJNLcAAAAJ&hl=en), [Qiming Zhang @ University of Sydney](https://scholar.google.com/citations?user=f8rAZ7MAAAAJ&hl=zh-CN), [Yufei Xu](https://scholar.google.com/citations?user=hlYWxX8AAAAJ&hl=zh-CN), and [Jing Zhang @ Renmin University of China](https://xiaojingzi.github.io/); 42 | - Website: https://cvpr2023.thecvf.com/virtual/2023/tutorial/18572; 43 | 44 | ## Talks 45 | 1. [Interpreting Deep Generative Models 46 | for Interactive AI Content Creation, Bolei Zhou, CUHK.](https://www.youtube.com/watch?v=PtRU2B6Iml4) -------------------------------------------------------------------------------- /notes/sadtalker.md: -------------------------------------------------------------------------------- 1 | ## SadTalker: Learning Realistic 3D Motion Coefficients for Stylized Audio-Driven Single Image Talking Face Animation 2 | A VAE-based poseNet is integrated with a new talking-face generation pipeline SadTalker to generate diverse head motions on talking-face videos. 3 | 4 |  5 | 6 | **What's new:** Wenxuan Zhang from Xi'an Jiaotong University and researchers from Tencent AI Lab have released a new generation pipeline named SadTalker to synthesis diverse and stylized talking-face videos. 7 | 8 | **Key insights:** Facial landmarks and keypoints were previously used as the intermediate facial representation in works, but they were found to be difficult to disentangle with expressions and movements. To address this issue, SadTalker leveraged a explicit 3D face model to decouple the representations of expression and head motions. As a result, they designed PoseVAE and ExpNet to learn audio-to-expression and audio-to-pose, respectively. 9 | 10 | 11 | 12 | **How it works:** As shown in the Figure 2, the authors designed a three-stage inference pipeline to synthesize stylized talking-face videos. 13 | - They first leveraged the recent single image deep 3D reconstruction method to extract the 3D face model on the target image, which consists of identity coefficients, expression coefficients and head pose (rotation and translation). 14 | - Secondly, they estimated the pose and expression coefficients using PoseVAE and ExpNet, respectively. After that, they obtained a motion flow in the 3D facial space. During training, the authors used reconstruction loss and distillation loss to motivate ExpNet to learn accurate mapping on the entire facial motion and the lip motion, respectively. Similarly, reconstruction loss and KL-divergence loss were used to motivate PoseVAE to learn accurate and diverse head motions, respectively. 15 | - In the third, they leveraged a modified face vid2vid model to render real face from the estimated 3D facial motion. In training, they first trained face vid2vid in a self-supervised fashion. Then they fine-tuned the customized mappingNet in a reconstruction style. 16 | 17 | 18 | *Note: Training details are introduced in [the supplementary material](https://openaccess.thecvf.com/content/CVPR2023/supplemental/Zhang_SadTalker_Learning_Realistic_CVPR_2023_supplemental.pdf).* 19 | 20 | **Results:** The authors trained SadTalker on VoxCeleb1 with 8 NVIDIA A100 GPUs and tested it on HDTF dataset. Compared with other competitors, it generated a better head motion and more realistic face. But its lip-sync is bad than other methods. 21 | 22 | 23 | 24 | **Why it matters:** This study demonstrated that an explicit 3D face model was a more accurate intermediate representation of the facial characteristics than previous methods. Additionally, the use of VAE models for estimating head pose from audio was shown to improve the naturalness of the synthesis videos. These findings have the potential to enhance the quality of talking-face pipelines for machine learning researchers. 25 | 26 | 27 | 28 | **We're thinking:** Instead of using Wav2Lip as a teacher model in ExpNet training, can we use Wav2Lip as ExpNet directly? The 3D-aware face render should have similar capabilities to face vid2vid, allowing for fine-tuning of the render with 10~30s video clips of a target person to improve lip-sync results. 29 | 30 | *Note: Few-shot learning is widely studied in deep generation models. Details can refer to [fs-vid2vid, NeurIPS'19, NVIDIA](https://nvlabs.github.io/few-shot-vid2vid/) and [face-few-shot, ICCV'19, Samsung AI](https://openaccess.thecvf.com/content_ICCV_2019/papers/Zakharov_Few-Shot_Adversarial_Learning_of_Realistic_Neural_Talking_Head_Models_ICCV_2019_paper.pdf).* -------------------------------------------------------------------------------- /advances/iccv23_papers.md: -------------------------------------------------------------------------------- 1 | # ICCV'23, Oct 4-6, 2023 2 | In this page, we provide a short review of recent advances (21 papers) on the digital human, including 3D face reconstruction, talking-face synthesis, talking-body synthesis, and talking-face video editing. 3 | 4 | ## Workshops and Tutorials 5 | 1. To NeRF or not to NeRF: 6 | A View Synthesis Challenge for Human Heads https://sites.google.com/view/vschh/home 7 | 2. The 11th IEEE International Workshop on Analysis and Modeling of Faces and Gestures https://web.northeastern.edu/smilelab/amfg2023/ 8 | ## 3D Face Reconstruction (3) 9 | 10 |  11 | 12 | 13 | 14 | - [Speech4Mesh: Speech-Assisted Monocular 3D Facial Reconstruction for Speech-Driven 3D Facial Animation](https://openaccess.thecvf.com/content/ICCV2023/papers/He_Speech4Mesh_Speech-Assisted_Monocular_3D_Facial_Reconstruction_for_Speech-Driven_3D_Facial_ICCV_2023_paper.pdf), University of Science and Technology of China, [FLAME-Universe (3DMM alternative)](https://github.com/TimoBolkart/FLAME-Universe). 15 | - [HiFace: High-Fidelity 3D Face Reconstruction by 16 | Learning Static and Dynamic Details](https://openaccess.thecvf.com/content/ICCV2023/papers/Chai_HiFace_High-Fidelity_3D_Face_Reconstruction_by_Learning_Static_and_Dynamic_ICCV_2023_paper.pdf), National University of Singapore, [ProjectPage](https://project-hiface.github.io/). 17 | - [ASM: Adaptive Skinning Model for High-Quality 3D Face Modeling](https://openaccess.thecvf.com/content/ICCV2023/papers/Yang_ASM_Adaptive_Skinning_Model_for_High-Quality_3D_Face_Modeling_ICCV_2023_paper.pdf), Tencent AI Lab. 18 | ## Talking-Face Synthesis 19 | 20 | ### Metrics and Benchchmarks (1) 21 | - [On the Audio-visual Synchronization for Lip-to-Speech Synthesis](https://openaccess.thecvf.com/content/ICCV2023/papers/Niu_On_the_Audio-visual_Synchronization_for_Lip-to-Speech_Synthesis_ICCV_2023_paper.pdf), The Hong Kong University of Science and Technology. 22 | 23 | ### 2D Talking-Face Synthesis (11) 24 | *keywords: emotion, diffusion priors, memory network, StyleGAN2*; 25 | 26 |  27 | 28 | - [Emotional Listener Portrait: Realistic Listener Motion Simulation in Conversation](https://openaccess.thecvf.com/content/ICCV2023/html/Song_Emotional_Listener_Portrait_Neural_Listener_Head_Generation_with_Emotion_ICCV_2023_paper.html), University of Rochester. 29 | - [Talking Head Generation with Probabilistic Audio-to-Visual Diffusion Priors](https://openaccess.thecvf.com/content/ICCV2023/html/Yu_Talking_Head_Generation_with_Probabilistic_Audio-to-Visual_Diffusion_Priors_ICCV_2023_paper.html), Xiaobing.AI, [ProjectPage](https://zxyin.github.io/TH-PAD/). 30 | - [Efficient Emotional Adaptation for Audio-Driven Talking-Head Generation](https://openaccess.thecvf.com/content/ICCV2023/papers/Gan_Efficient_Emotional_Adaptation_for_Audio-Driven_Talking-Head_Generation_ICCV_2023_paper.pdf), Zhejiang University, [ProjectPage](https://yuangan.github.io/eat/), [Code](https://github.com/yuangan/EAT_code), [Blog](https://zhuanlan.zhihu.com/p/658569026). 31 | - [Implicit Identity Representation Conditioned Memory Compensation Network for Talking Head Video Generation](https://openaccess.thecvf.com/content/ICCV2023/papers/Hong_Implicit_Identity_Representation_Conditioned_Memory_Compensation_Network_for_Talking_Head_ICCV_2023_paper.pdf), The Hong Kong University of Science and Technology, [ProjectPage](https://harlanhong.github.io/publications/mcnet.html), [Code](https://github.com/harlanhong/ICCV2023-MCNET). 32 | - [ToonTalker: Cross-Domain Face Reenactment](https://openaccess.thecvf.com/content/ICCV2023/papers/Gong_ToonTalker_Cross-Domain_Face_Reenactment_ICCV_2023_paper.pdf), Tsinghua University, [ProjectPage](https://opentalker.github.io/ToonTalker/), [Code](https://github.com/yuanygong/ToonTalker). 33 | - [Robust One-Shot Face Video Re-enactment using Hybrid Latent Spaces of StyleGAN2](https://openaccess.thecvf.com/content/ICCV2023/papers/Oorloff_Robust_One-Shot_Face_Video_Re-enactment_using_Hybrid_Latent_Spaces_of_ICCV_2023_paper.pdf), University of Maryland, [ProjectPage](https://trevineoorloff.github.io/FaceVideoReenactment_HybridLatents.io/). 34 | - [EMMN: Emotional Motion Memory Network for Audio-driven Emotional Talking Face Generation](https://openaccess.thecvf.com/content/ICCV2023/papers/Tan_EMMN_Emotional_Motion_Memory_Network_for_Audio-driven_Emotional_Talking_Face_ICCV_2023_paper.pdf), Shanghai Jiao Tong University. 35 | - [EmoTalk: Speech-Driven Emotional Disentanglement for 3D Face Animation](https://openaccess.thecvf.com/content/ICCV2023/papers/Peng_EmoTalk_Speech-Driven_Emotional_Disentanglement_for_3D_Face_Animation_ICCV_2023_paper.pdf), Renmin University of China, [ProjectPage](https://ziqiaopeng.github.io/emotalk/), [Code](https://github.com/psyai-net/EmoTalk_release). 36 | - [HyperReenact: One-Shot Reenactment via Jointly Learning to Refine and 37 | Retarget Faces](https://openaccess.thecvf.com/content/ICCV2023/papers/Bounareli_HyperReenact_One-Shot_Reenactment_via_Jointly_Learning_to_Refine_and_Retarget_ICCV_2023_paper.pdf), Kingston University London, [Code](https://github.com/StelaBou/HyperReenact). 38 | - [MODA: Mapping-Once Audio-driven Portrait Animation with Dual Attentions](https://openaccess.thecvf.com/content/ICCV2023/papers/Liu_MODA_Mapping-Once_Audio-driven_Portrait_Animation_with_Dual_Attentions_ICCV_2023_paper.pdf), International Digital Economy Academy (IDEA), [ProjectPage](https://liuyunfei.net/projects/iccv23-moda/), [Code](https://github.com/DreamtaleCore/MODA). 39 | - [SPACE: Speech-driven Portrait Animation with Controllable Expression](https://openaccess.thecvf.com/content/ICCV2023/papers/Gururani_SPACE_Speech-driven_Portrait_Animation_with_Controllable_Expression_ICCV_2023_paper.pdf), NVIDIA, [ProjectPage](https://research.nvidia.com/labs/dir/space/). 40 | ### 3D Talking-Face Synthesis (1) 41 | *keywords: efficiency*; 42 | 43 | - [Efficient Region-Aware Neural Radiance Fields for High-Fidelity Talking Portrait Synthesis](https://openaccess.thecvf.com/content/ICCV2023/html/Li_Efficient_Region-Aware_Neural_Radiance_Fields_for_High-Fidelity_Talking_Portrait_Synthesis_ICCV_2023_paper.html), Beihang University, [Code](https://github.com/Fictionarry/ER-NeRF). 44 | 45 | ## Talking-Face Video Editing (1) 46 | 47 | *keywords: GAN inversion, StyleGAN*; 48 | 49 | - [RIGID: Recurrent GAN Inversion and Editing of Real Face Videos](https://openaccess.thecvf.com/content/ICCV2023/papers/Xu_RIGID_Recurrent_GAN_Inversion_and_Editing_of_Real_Face_Videos_ICCV_2023_paper.pdf), The University of Hong Kong, [ProjectPage](https://cnnlstm.github.io/RIGID/), [Code](https://github.com/cnnlstm/RIGID). 50 | 51 | ## Talking-Body Synthesis (4) 52 | 53 | *keywords: continual learning, lively, one-shot*; 54 | 55 | - [Continual Learning for Personalized Co-Speech Gesture Generation](https://openaccess.thecvf.com/content/ICCV2023/html/Ahuja_Continual_Learning_for_Personalized_Co-speech_Gesture_Generation_ICCV_2023_paper.html), CMU, [ProjectPage](https://chahuja.com/cdiffgan/), [Dataset](https://chahuja.com/pats/). 56 | - [LivelySpeaker: Towards Semantic-Aware Co-Speech Gesture Generation](https://openaccess.thecvf.com/content/ICCV2023/html/Zhi_LivelySpeaker_Towards_Semantic-Aware_Co-Speech_Gesture_Generation_ICCV_2023_paper.html), ShanghaiTech University, [Code](https://github.com/zyhbili/LivelySpeaker). 57 | - [DINAR: Diffusion Inpainting of Neural Textures for One-Shot Human Avatars](https://openaccess.thecvf.com/content/ICCV2023/html/Svitov_DINAR_Diffusion_Inpainting_of_Neural_Textures_for_One-Shot_Human_Avatars_ICCV_2023_paper.html), Samsung AI Center, [Code](https://github.com/SamsungLabs/DINAR). 58 | - [One-shot Implicit Animatable Avatars with Model-based Priors](https://openaccess.thecvf.com/content/ICCV2023/html/Huang_One-shot_Implicit_Animatable_Avatars_with_Model-based_Priors_ICCV_2023_paper.html), Zhejiang University, [Code](https://github.com/huangyangyi/ELICIT). 59 | -------------------------------------------------------------------------------- /benchmarks/readme.md: -------------------------------------------------------------------------------- 1 | # A Microbenchmark for Talking-Face Synthesis 2 | ### [**Dataset**](https://drive.google.com/drive/folders/1vBse3rgHd3JfTGNFXC-oUZs5DR9B5Mep?usp=sharing) | [**Website**](https://jason-cs18.github.io/awesome-avatar/benchmarks/) 3 | 4 | This repository contains the datasets and testing scripts for talking-face synthesis. 5 | 6 | > A microbenchmark serves as a valuable tool for researchers to conduct speedy evaluations of new algorithms. This repository can be easily customized and applied to diverse audio-visual talking-face datasets. 7 | 8 | ### Datasets 9 | In this benchmark, we collect 3 videos for English speakers and 3 videos for Chinese speakers. 10 | 11 | 14 | 15 | 16 | 17 | #### File Structure 18 | ``` 19 | ├── driving_audios 20 | | ├── [9.3M] may_english_audio.aac 21 | | ├── [3.3M] macron_english_trim_audio.aac 22 | | ├── [3.5M] obama1_english_audio.aac 23 | | ├── [780K] laoliang_chinese_50s_audio.mp3 24 | | ├── [4.3M] luoxiang_chinese_audio.mp3 25 | | ├── [8.9M] zuijiapaidang_chinese_audio.mp3 26 | ├── source_images 27 | | ├── [294K] may.png 28 | | ├── [202K] macron.png 29 | | ├── [213M] obama1.png 30 | | ├── [206K] zuijiapaidang.png 31 | | ├── [175K] luoxiang.png 32 | | ├── [204K] laoliang.png 33 | ├── reference_videos 34 | │ ├── [56M] obama1_english.mp4, 03:38.16, 25fps, 450x450, 46 sentences 35 | │ ├── [96M] may_english.mp4, 04:02.97, 25fps, 512x512, 35 sentences 36 | │ ├── [24M] macron_english_trim.mp4, 00:03:31.92, 25fps, 512x512, 49 sentences 37 | │ ├── [3.6M] laoliang_chinese_50s.mp4, 00:00:49.85, 30fps, 410x380, 40 sentences 38 | │ ├── [14M] luoxiang_chinese.mp4, 04:40.01, 25fps, 350x500, 32 sentences 39 | │ ├── [28M] zuijiapaidang_chinese.mp4, 09:41.98, 30fps, 460x450, 85 sentences 40 | ``` 41 | 42 |
| 53 | | 54 | | 55 | |
| 66 | | 67 | | 68 | |
PSNR: 32.287, SSIM: 0.951, FID: 18.993 |
92 | PSNR: 32.572, SSIM: 0.936, FID: 33.941 |
93 | PSNR: 35.737, SSIM: 0.969, FID: 6.121 |
94 |
| 99 | | 100 | | 101 | |
PSNR: 31.444, SSIM: 0.939, FID: 19.192 |
107 | PSNR: 34.367, SSIM: 0.971, FID: 23.631 |
108 | PSNR: 20.364, SSIM: 0.783, FID: 49.04 |
109 |
| 112 | | 113 | | 114 | |
PSNR: 20.587, SSIM: 0.754, FID: 24.051 |
123 | PSNR: 19.211, SSIM: 0.701, FID: 46.182 |
124 | PSNR: 18.729, SSIM: 0.763, FID: 98.982 |
125 |
| 130 | | 131 | | 132 | |
PSNR: 18.536, SSIM: 0.672, FID: 52.362 |
138 | PSNR: 14.363, SSIM: 0.598, FID: 104.221 |
139 | PSNR: 17.359, SSIM: 0.725, FID: 4.781 |
140 |
| 143 | | 144 | | 145 | |
| Pipeline | 156 |Sync↑ | 157 |PSNR↑ | 158 |SSIM↑ | 159 |FID↓ | 160 |Pipeline | 161 |Sync↑ | 162 |PSNR↑ | 163 |SSIM↓ | 164 |FID↓ | 165 |
| Wav2Lip | 168 |xxx | 169 |33.532 | 170 |0.952 | 171 |19.685 | 172 | 173 |Wav2Lip | 174 |xxx | 175 |28.725 | 176 |0.897 | 177 |30.621 | 178 | 179 |
| SadTalker | 182 |xxx | 183 |19.509 | 184 |0.739 | 185 |56.407 | 186 | 187 |SadTaler | 188 |xxx | 189 |16.753 | 190 |0.665 | 191 |68.120 | 192 | 193 |
| marcon_GeneFace.mp4 | 208 |macron_ER-NeRF.mp4 | 209 |
| zuijiapaidang_GeneFace.mp4 | 214 |zuijiapaidang_ER-NeRF.mp4 | 215 |
| Pipeline | 228 |Sync↑ | 229 |PSNR↑ | 230 |SSIM↓ | 231 |FID↓ | 232 |IS↑ | 233 |Pipeline | 234 |Sync↑ | 235 |PSNR↑ | 236 |SSIM↓ | 237 |FID↓ | 238 |IS↑ | 239 |
| GeneFace | 242 |xxx | 243 |xxx | 244 |xxx | 245 |xxx | 246 |xxx | 247 |GeneFace | 248 |xxx | 249 |xxx | 250 |xxx | 251 |xxx | 252 |xxx | 253 |
| ER-NeRF | 256 |xxx | 257 |xxx | 258 |xxx | 259 |xxx | 260 |xxx | 261 |ER-NeRF | 262 |xxx | 263 |xxx | 264 |xxx | 265 |xxx | 266 |xxx | 267 |
| Dataset name | 134 |Environment | 135 |Year | 136 |Resolution | 137 |Subject | 138 |Duration | 139 |Sentence | 140 |
| VoxCeleb1 | 143 |Wild | 144 |2017 | 145 |360p~720p | 146 |1251 | 147 |352 hours | 148 |100k | 149 |
| VoxCeleb2 | 152 |Wild | 153 |2018 | 154 |360p~720p | 155 |6112 | 156 |2442 hours | 157 |1128k | 158 |
| HDTF | 161 |Wild | 162 |2020 | 163 |720p~1080p | 164 |300+ | 165 |15.8 hours | 166 |167 | |
| LSP | 170 |Wild | 171 |2021 | 172 |720p~1080p | 173 |4 | 174 |18 minutes | 175 |100k | 176 |
| Dataset name | 182 |Environment | 183 |Year | 184 |Resolution | 185 |Subject | 186 |Duration | 187 |Sentence | 188 |
| CMLR | 191 |Lab | 192 |2019 | 193 |194 | | 11 | 195 |196 | | 102k | 197 |
| MAVD | 200 |Lab | 201 |2023 | 202 |1920x1080 | 203 |64 | 204 |24 hours | 205 |12k | 206 |
| CN-Celeb | 209 |Wild | 210 |2020 | 211 |212 | | 3000 | 213 |1200 hours | 214 |215 | |
| CN-Celeb-AV | 218 |Wild | 219 |2023 | 220 |221 | | 1136 | 222 |660 hours | 223 |224 | |
| CN-CVS | 227 |Wild | 228 |2023 | 229 |230 | | 2500+ | 231 |300+ hours | 232 |233 | |
| Metric name | 276 |Description | 277 |Code/Paper | 278 |
| LMD↓ | 281 |Mouth landmark distance | 282 |283 | |
| LMD↓ | 286 |Mouth landmark distance | 287 |288 | |
| MA↑ | 291 |The Insertion-over-Union (IoU) for the overlap between the predicted mouth area and the ground truth area | 292 |293 | |
| Sync↑ | 296 |The confidence score from SyncNet (Sync) | 297 |wav2lip | 298 |
| LSE-C↑ | 301 |Lip Sync Error - Confidence | 302 |wav2lip | 303 |
| LSE-D↓ | 306 |Lip Sync Error - Distance | 307 |wav2lip | 308 |
| Metric name | 314 |Description | 315 |Code/Paper | 316 |
| MAE↓ | 319 |Mean Absolute Error metric for image | 320 |mmagic | 321 |
| MSE↓ | 324 |Mean Squared Error metric for image | 325 |mmagic | 326 |
| PSNR↑ | 329 |Peak Signal-to-Noise Ratio | 330 |mmagic | 331 |
| SSIM↑ | 334 |Structural similarity for image | 335 |mmagic | 336 |
| FID↓ | 339 |Frchet Inception Distance | 340 |mmagic | 341 |
| IS↑ | 344 |Inception score | 345 |mmagic | 346 |
| NIQE↓ | 349 |Natural Image Quality Evaluator metric | 350 |mmagic | 351 |
| CSIM↑ | 354 |The cosine similarity of identity embedding | 355 |InsightFace | 356 |
| CPBD↑ | 359 |The cumulative probability blur detection | 360 |python-cpbd | 361 |
| Metric name | 367 |Description | 368 |Code/Paper | 369 |
| Diversity of head motions↑ | 372 |A standard deviation of the head motion feature embeddings extracted from the generated frames using Hopenet (Ruiz et al., 2018) is calculated | 373 |SadTalker | 374 |
| Beat Align Score↑ | 377 |The alignment of the audio and generated head motions is calculated in Bailando (Siyao et al., 2022) | 378 |SadTalker | 379 |