├── notes ├── moda.md ├── readme.md ├── example.md ├── StyleHEAT note.md └── sadtalker.md ├── advances ├── README.md ├── tutorials.md └── iccv23_papers.md ├── benchmarks ├── utils │ ├── __init__.py │ ├── __pycache__ │ │ ├── __init__.cpython-39.pyc │ │ └── video_processing.cpython-39.pyc │ └── video_processing.py ├── assets │ └── file_structure.png ├── test.ipynb └── readme.md ├── assets ├── StyleHEAT.png └── sadtalker.png ├── SUMMARY.md ├── memory.md └── README.md /notes/moda.md: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /advances/README.md: -------------------------------------------------------------------------------- 1 | # advances 2 | 3 | -------------------------------------------------------------------------------- /benchmarks/utils/__init__.py: -------------------------------------------------------------------------------- 1 | from .video_processing import count_sentences -------------------------------------------------------------------------------- /assets/StyleHEAT.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Jason-cs18/awesome-avatar/HEAD/assets/StyleHEAT.png -------------------------------------------------------------------------------- /assets/sadtalker.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Jason-cs18/awesome-avatar/HEAD/assets/sadtalker.png -------------------------------------------------------------------------------- /benchmarks/assets/file_structure.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Jason-cs18/awesome-avatar/HEAD/benchmarks/assets/file_structure.png -------------------------------------------------------------------------------- /benchmarks/utils/__pycache__/__init__.cpython-39.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Jason-cs18/awesome-avatar/HEAD/benchmarks/utils/__pycache__/__init__.cpython-39.pyc -------------------------------------------------------------------------------- /benchmarks/utils/__pycache__/video_processing.cpython-39.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Jason-cs18/awesome-avatar/HEAD/benchmarks/utils/__pycache__/video_processing.cpython-39.pyc -------------------------------------------------------------------------------- /benchmarks/utils/video_processing.py: -------------------------------------------------------------------------------- 1 | def count_sentences(text_url): 2 | """ 3 | Counts the number of sentences in a given text. 4 | :param text: The text to count the sentences in. 5 | :return: The number of sentences in the text. 6 | """ 7 | with open(text_url, 'r') as f: 8 | text = f.readlines() 9 | 10 | return len(text[0].split(".")) -------------------------------------------------------------------------------- /SUMMARY.md: -------------------------------------------------------------------------------- 1 | # Table of contents 2 | 3 | * [awesome-avatar](README.md) 4 | * [advances](advances/README.md) 5 | * [ICCV'23, Oct 4-6, 2023](advances/iccv23\_papers.md) 6 | * [Paper note](notes/readme.md) 7 | * [Title](notes/example.md) 8 | * [moda](notes/moda.md) 9 | * [SadTalker: Learning Realistic 3D Motion Coefficients for Stylized Audio-Driven Single Image Talking](notes/sadtalker.md) 10 | -------------------------------------------------------------------------------- /notes/readme.md: -------------------------------------------------------------------------------- 1 | # Paper note 2 | All notes follow the style of [DeeplearningAI](https://www.deeplearning.ai/the-batch/tag/research/). 3 | 4 | - [Note Template](https://github.com/Jason-cs18/awesome-avatar/blob/main/notes/example.md) 5 | - [SadTalker CVPR'23](https://github.com/Jason-cs18/awesome-avatar/blob/main/notes/sadtalker.md) 6 | - [MODA ICCV'23](https://github.com/Jason-cs18/awesome-avatar/blob/main/notes/moda.md) 7 | -------------------------------------------------------------------------------- /memory.md: -------------------------------------------------------------------------------- 1 | # Memory-Augmented Talking Face Synthesis 2 | - [arXiv 2020.10] [Audio-driven Talking Face Video Generation with Learning-based Personalized Head Pose](https://arxiv.org/pdf/2002.10137.pdf) | Tsinghua University 3 | - [AAAI'22 Oral] [SyncTalkFace: Talking Face Generation with Precise Lip-Syncing via Audio-Lip Memory](https://arxiv.org/abs/2211.00924) | KAIST 4 | - [arXiv 2022.12] [Memories are One-to-Many Mapping Alleviators in Talking Face Generation](https://arxiv.org/pdf/2212.05005.pdf) | Shanghai Jiao Tong University 5 | - [ICCV'23] [EMMN: Emotional Motion Memory Network for Audio-driven Emotional Talking Face Generation](https://openaccess.thecvf.com/content/ICCV2023/papers/Tan_EMMN_Emotional_Motion_Memory_Network_for_Audio-driven_Emotional_Talking_Face_ICCV_2023_paper.pdf) | Shanghai Jiao Tong University 6 | - [arXiv 2023.05] [GeneFace++: Generalized and Stable Real-Time Audio-Driven 3D Talking Face Generation](https://arxiv.org/abs/2305.00787) | Zhejiang University 7 | -------------------------------------------------------------------------------- /benchmarks/test.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 7, 6 | "metadata": {}, 7 | "outputs": [ 8 | { 9 | "name": "stdout", 10 | "output_type": "stream", 11 | "text": [ 12 | "49\n" 13 | ] 14 | } 15 | ], 16 | "source": [ 17 | "import utils\n", 18 | "\n", 19 | "sample_transcript_file = \"/home/jason/Downloads/transcript (3).txt\"\n", 20 | "\n", 21 | "print(utils.count_sentences(sample_transcript_file))" 22 | ] 23 | }, 24 | { 25 | "cell_type": "code", 26 | "execution_count": null, 27 | "metadata": {}, 28 | "outputs": [], 29 | "source": [] 30 | } 31 | ], 32 | "metadata": { 33 | "kernelspec": { 34 | "display_name": "pytorch", 35 | "language": "python", 36 | "name": "python3" 37 | }, 38 | "language_info": { 39 | "codemirror_mode": { 40 | "name": "ipython", 41 | "version": 3 42 | }, 43 | "file_extension": ".py", 44 | "mimetype": "text/x-python", 45 | "name": "python", 46 | "nbconvert_exporter": "python", 47 | "pygments_lexer": "ipython3", 48 | "version": "3.9.17" 49 | } 50 | }, 51 | "nbformat": 4, 52 | "nbformat_minor": 2 53 | } 54 | -------------------------------------------------------------------------------- /notes/example.md: -------------------------------------------------------------------------------- 1 | ## Title 2 | Use 1-2 sentences to highlight the key contributions of the paper. 3 | 4 | **what's new:** The authors (xxx and xxx from xxx) proposed xxx to address/achieve xxx. 5 | 6 | **key insights:** Previous works leverages xxx to achieve xxx but they are limited by xxx. To overcome xxx, authors designed xxx. 7 | 8 | **How it works:** The authors designed three-stage pipeline named xxx to generate xxx. In the first, xxx-1. In the second, xxx-2. In the end, xxx-3. 9 | - xxx-1 is trained with xxx. Compared with existing works, xxx is better than xxx. It is because that xxx. 10 | - xxx-2 is trained with xxx. Compared with existing works, xxx is better than xxx. It is because that xxx. 11 | - xxx-3 is trained with xxx. Compared with existing works, xxx is better than xxx. It is because that xxx. 12 | 13 | **Results:** The authors evaluated xxx on xxx. Compared with xxx, xxx is better on xxx. But it is worse than xxx on xxx. It is because that xxx. 14 | 15 | **Why it matters:** This work reveals that xxx but xxx. Such insights deepen our understanding of xxx and can help practitioners explain their outputs. 16 | 17 | **We're thinking:** xxx-1 and xxx-2 may be useful for other pipelines because xxx. 18 | 19 | -------------------------------------------------------------------------------- /notes/StyleHEAT note.md: -------------------------------------------------------------------------------- 1 | # StyleHEAT 2 | 3 | ## StyleHEAT 4 | 5 | Propose a unified framework based on a pre-trained Style-GAN for one-shot talking face generation,which use audio or video driving a source image. 6 | 7 | **what’s new:** 8 | 9 | Fei Yin from Tsinghua University and researchers from Tencent AI Lab have released a new generation pipeline named StyleHEAT to synthesis diverse and stylized talking-face videos 10 | 11 | **key insights:** 12 | 13 | Previous work was unable to generate high-resolution speaking videos due to dataset, efficiency, and other limitations. To overcome the low quality of video resolution,the author verified the spatial characteristics of sytleGAN and added it to the generation framework to achieve high-resolution face speaking video generation, and can edit it by attributes. 14 | 15 | **How it works:** 16 | 17 | The authors designed Video-Driven , Autdio-driven Motion Generator and Feature Calibration to achieve the unified framework. 18 | 19 | - First it obtain the style codes and feature maps of the source image by the encoder of GAN inversion. 20 | - Second the video or audio along with the source image are used to predict motion fields by the corresponding **motion generator**. The motion generator output the desired flow fields for feature warping 21 | - Driving-Video Motion Generator use 3DMM parameters as the motion representation,and the network is based on U-Net 22 | - Driving-Audio Motion Generator use the Mel-Spectrogram,use an MLP to squeeze the temporal dimension. The network is the via AdaIN 23 | - Then,the selected feature map is warped by the motion fields, followed by the **calibration network** for rectifying feature distortions. 24 | - The refined feature map is then fed into the StyleGAN for the final face generation. 25 | 26 | ![Style](https://github.com/SuperGoodGame/awesome-avatar/blob/main/assets/StyleHEAT.png) 27 | 28 | **Results:** 29 | 30 | Authors train the two motion generators on the VoxCeleb dataset,while joint training the whole framework on the HDTF dataset.This model output is a 1024×1024 image generated by the pre-trained styleGAN . Authors evaluated the generation picture with the other work,containing reappearance of the same character and different characters.Compared with other methods qualitatively and quantitatively, this method generates images that are more natural and have higher resolution. In terms of lip sync, more details and higher resolution than wav2lip. 31 | 32 | **Why it matters:** 33 | 34 | This work proposes a unified framework based on a pre-trained Style-GAN for one-shot high-quality talking face generation.Such insights provide us with a new method and idea to improve the generation resolution. 35 | 36 | **We’re thinking:** 37 | 38 | We can achieve the last step of generating video through GAN inversion-styleGAN inference, replacing the last step FaceRender in sadtalker 39 | -------------------------------------------------------------------------------- /advances/tutorials.md: -------------------------------------------------------------------------------- 1 | # Courses, Talks and Tutorials 2 | ## Courses 3 | ### Learning-Based Image Synthesis 4 | - Basic info: CMU 16-726, Spring 2023; 5 | - Instructor: [Jun-Yan Zhu](https://www.cs.cmu.edu/~junyanz/); 6 | - Website: https://learning-image-synthesis.github.io/sp23/ 7 | 8 | ## Tutorials 9 | ### Video Synthesis: Early Days and New Developments 10 | - Conference: ECCV'22 (Oct 24, 2022) 11 | - Organizers: [Sergey Tulyakov @ Snap Research](http://www.stulyakov.com/), [Jian Ren @ Snap Research](https://alanspike.github.io/), [Stéphane Lathuilière @ Telecom Paris](https://stelat.eu/), and [Aliaksandr Siarohin @ Snap Research](https://aliaksandrsiarohin.github.io/aliaksandr-siarohin-website/); 12 | - Website: https://snap-research.github.io/video-synthesis-tutorial/; 13 | 14 | ### Efficient Neural Networks: From Algorithm Design to Practical Mobile Deployments 15 | - Conference: CVPR'23 (June 18, 2023); 16 | - Organizers: [Jian Ren @ Snap Research](https://alanspike.github.io/), [Sergey Tulyakov @ Snap Research](http://www.stulyakov.com/), and [Eric Hu @ Snap](https://www.linkedin.com/in/erichuju/); 17 | - Website: https://snap-research.github.io/efficient-nn-tutorial/; 18 | 19 | ### Full-Stack, GPU-based Acceleration of Deep Learning 20 | - Conference: CVPR'23 (June 18, 2023); 21 | - Organizers: [Maying Shen @ NVIDIA](https://mayings.github.io/), [Jason Clemons @ NVIDIA Research](https://scholar.google.com/citations?user=J_1GGJsAAAAJ&hl=zh-CN), [Hongxu (Danny) Yin @ NVIDIA Research](https://hongxu-yin.github.io/), and [Pavlo Molchanov @ NVIDIA Research](https://www.pmolchanov.com/); 22 | - Website: https://nvlabs.github.io/EfficientDL/; 23 | 24 | ### Denoising Diffusion Models: A Generative Learning Big Bang 25 | - Conference: CVPR'23 (June 18, 2023); 26 | - Organizers: [Jiaming Song @ NVIDIA Research](https://tsong.me/), [Chenlin Meng @ Stanford](https://cs.stanford.edu/~chenlin/), and [Arash Vahdat @ NVIDIA Research](http://latentspace.cc/); 27 | - Website: https://cvpr2023-tutorial-diffusion-models.github.io/; 28 | 29 | ### Prompting in Vision 30 | - Conference: CVPR'23 (June 18, 2023); 31 | - Organizers: [Kaiyang Zhou @ NTU](https://kaiyangzhou.github.io/), [Ziwei Liu @ NTU](https://liuziwei7.github.io/), [Phillip Isola @ MIT](http://web.mit.edu/phillipi/), [Hyojin Bahng @ MIT](), [Ludwig Schmidt @ UW](https://people.csail.mit.edu/ludwigs/), [Sarah Pratt @ UW](https://sarahpratt.github.io/), and [Denny Zhou @ Google Research](https://dennyzhou.github.io/) 32 | - Website: https://prompting-in-vision.github.io/; 33 | 34 | ### All Things ViTs: Understanding and Interpreting Attention in Vision 35 | - Conference: CVPR'23 (June 18, 2023); 36 | - Organizers: [Hila Chefer @ Tel Aviv University](https://hila-chefer.github.io/) and [Sayak Paul @ Google Resaearch & Hugging Face](https://sayak.dev/); 37 | - Website: https://all-things-vits.github.io/atv/; 38 | 39 | ### Vision Transformer: More is different 40 | - Conference: CVPR'23 (June 18, 2023); 41 | - Organizers: [Dacheng Tao @ University of Sydney](https://scholar.google.com/citations?user=RwlJNLcAAAAJ&hl=en), [Qiming Zhang @ University of Sydney](https://scholar.google.com/citations?user=f8rAZ7MAAAAJ&hl=zh-CN), [Yufei Xu](https://scholar.google.com/citations?user=hlYWxX8AAAAJ&hl=zh-CN), and [Jing Zhang @ Renmin University of China](https://xiaojingzi.github.io/); 42 | - Website: https://cvpr2023.thecvf.com/virtual/2023/tutorial/18572; 43 | 44 | ## Talks 45 | 1. [Interpreting Deep Generative Models 46 | for Interactive AI Content Creation, Bolei Zhou, CUHK.](https://www.youtube.com/watch?v=PtRU2B6Iml4) -------------------------------------------------------------------------------- /notes/sadtalker.md: -------------------------------------------------------------------------------- 1 | ## SadTalker: Learning Realistic 3D Motion Coefficients for Stylized Audio-Driven Single Image Talking Face Animation 2 | A VAE-based poseNet is integrated with a new talking-face generation pipeline SadTalker to generate diverse head motions on talking-face videos. 3 | 4 | ![SadTalker overview](https://github.com/Jason-cs18/awesome-avatar/blob/main/assets/sadtalker.png "SadTalker overview") 5 | 6 | **What's new:** Wenxuan Zhang from Xi'an Jiaotong University and researchers from Tencent AI Lab have released a new generation pipeline named SadTalker to synthesis diverse and stylized talking-face videos. 7 | 8 | **Key insights:** Facial landmarks and keypoints were previously used as the intermediate facial representation in works, but they were found to be difficult to disentangle with expressions and movements. To address this issue, SadTalker leveraged a explicit 3D face model to decouple the representations of expression and head motions. As a result, they designed PoseVAE and ExpNet to learn audio-to-expression and audio-to-pose, respectively. 9 | 10 | 11 | 12 | **How it works:** As shown in the Figure 2, the authors designed a three-stage inference pipeline to synthesize stylized talking-face videos. 13 | - They first leveraged the recent single image deep 3D reconstruction method to extract the 3D face model on the target image, which consists of identity coefficients, expression coefficients and head pose (rotation and translation). 14 | - Secondly, they estimated the pose and expression coefficients using PoseVAE and ExpNet, respectively. After that, they obtained a motion flow in the 3D facial space. During training, the authors used reconstruction loss and distillation loss to motivate ExpNet to learn accurate mapping on the entire facial motion and the lip motion, respectively. Similarly, reconstruction loss and KL-divergence loss were used to motivate PoseVAE to learn accurate and diverse head motions, respectively. 15 | - In the third, they leveraged a modified face vid2vid model to render real face from the estimated 3D facial motion. In training, they first trained face vid2vid in a self-supervised fashion. Then they fine-tuned the customized mappingNet in a reconstruction style. 16 | 17 | 18 | *Note: Training details are introduced in [the supplementary material](https://openaccess.thecvf.com/content/CVPR2023/supplemental/Zhang_SadTalker_Learning_Realistic_CVPR_2023_supplemental.pdf).* 19 | 20 | **Results:** The authors trained SadTalker on VoxCeleb1 with 8 NVIDIA A100 GPUs and tested it on HDTF dataset. Compared with other competitors, it generated a better head motion and more realistic face. But its lip-sync is bad than other methods. 21 | 22 | 23 | 24 | **Why it matters:** This study demonstrated that an explicit 3D face model was a more accurate intermediate representation of the facial characteristics than previous methods. Additionally, the use of VAE models for estimating head pose from audio was shown to improve the naturalness of the synthesis videos. These findings have the potential to enhance the quality of talking-face pipelines for machine learning researchers. 25 | 26 | 27 | 28 | **We're thinking:** Instead of using Wav2Lip as a teacher model in ExpNet training, can we use Wav2Lip as ExpNet directly? The 3D-aware face render should have similar capabilities to face vid2vid, allowing for fine-tuning of the render with 10~30s video clips of a target person to improve lip-sync results. 29 | 30 | *Note: Few-shot learning is widely studied in deep generation models. Details can refer to [fs-vid2vid, NeurIPS'19, NVIDIA](https://nvlabs.github.io/few-shot-vid2vid/) and [face-few-shot, ICCV'19, Samsung AI](https://openaccess.thecvf.com/content_ICCV_2019/papers/Zakharov_Few-Shot_Adversarial_Learning_of_Realistic_Neural_Talking_Head_Models_ICCV_2019_paper.pdf).* -------------------------------------------------------------------------------- /advances/iccv23_papers.md: -------------------------------------------------------------------------------- 1 | # ICCV'23, Oct 4-6, 2023 2 | In this page, we provide a short review of recent advances (21 papers) on the digital human, including 3D face reconstruction, talking-face synthesis, talking-body synthesis, and talking-face video editing. 3 | 4 | ## Workshops and Tutorials 5 | 1. To NeRF or not to NeRF: 6 | A View Synthesis Challenge for Human Heads https://sites.google.com/view/vschh/home 7 | 2. The 11th IEEE International Workshop on Analysis and Modeling of Faces and Gestures https://web.northeastern.edu/smilelab/amfg2023/ 8 | ## 3D Face Reconstruction (3) 9 | 10 | ![](https://project-hiface.github.io/img/detail.gif) 11 | 12 | 13 | 14 | - [Speech4Mesh: Speech-Assisted Monocular 3D Facial Reconstruction for Speech-Driven 3D Facial Animation](https://openaccess.thecvf.com/content/ICCV2023/papers/He_Speech4Mesh_Speech-Assisted_Monocular_3D_Facial_Reconstruction_for_Speech-Driven_3D_Facial_ICCV_2023_paper.pdf), University of Science and Technology of China, [FLAME-Universe (3DMM alternative)](https://github.com/TimoBolkart/FLAME-Universe). 15 | - [HiFace: High-Fidelity 3D Face Reconstruction by 16 | Learning Static and Dynamic Details](https://openaccess.thecvf.com/content/ICCV2023/papers/Chai_HiFace_High-Fidelity_3D_Face_Reconstruction_by_Learning_Static_and_Dynamic_ICCV_2023_paper.pdf), National University of Singapore, [ProjectPage](https://project-hiface.github.io/). 17 | - [ASM: Adaptive Skinning Model for High-Quality 3D Face Modeling](https://openaccess.thecvf.com/content/ICCV2023/papers/Yang_ASM_Adaptive_Skinning_Model_for_High-Quality_3D_Face_Modeling_ICCV_2023_paper.pdf), Tencent AI Lab. 18 | ## Talking-Face Synthesis 19 | 20 | ### Metrics and Benchchmarks (1) 21 | - [On the Audio-visual Synchronization for Lip-to-Speech Synthesis](https://openaccess.thecvf.com/content/ICCV2023/papers/Niu_On_the_Audio-visual_Synchronization_for_Lip-to-Speech_Synthesis_ICCV_2023_paper.pdf), The Hong Kong University of Science and Technology. 22 | 23 | ### 2D Talking-Face Synthesis (11) 24 | *keywords: emotion, diffusion priors, memory network, StyleGAN2*; 25 | 26 | ![](https://github.com/StelaBou/HyperReenact/raw/master/images/architecture.png) 27 | 28 | - [Emotional Listener Portrait: Realistic Listener Motion Simulation in Conversation](https://openaccess.thecvf.com/content/ICCV2023/html/Song_Emotional_Listener_Portrait_Neural_Listener_Head_Generation_with_Emotion_ICCV_2023_paper.html), University of Rochester.​ 29 | - [Talking Head Generation with Probabilistic Audio-to-Visual Diffusion Priors](https://openaccess.thecvf.com/content/ICCV2023/html/Yu_Talking_Head_Generation_with_Probabilistic_Audio-to-Visual_Diffusion_Priors_ICCV_2023_paper.html), Xiaobing.AI, [ProjectPage](https://zxyin.github.io/TH-PAD/).​ 30 | - [Efficient Emotional Adaptation for Audio-Driven Talking-Head Generation](https://openaccess.thecvf.com/content/ICCV2023/papers/Gan_Efficient_Emotional_Adaptation_for_Audio-Driven_Talking-Head_Generation_ICCV_2023_paper.pdf), Zhejiang University, [ProjectPage](https://yuangan.github.io/eat/), [Code](https://github.com/yuangan/EAT_code), [Blog](https://zhuanlan.zhihu.com/p/658569026).​ 31 | - [Implicit Identity Representation Conditioned Memory Compensation Network for Talking Head Video Generation](https://openaccess.thecvf.com/content/ICCV2023/papers/Hong_Implicit_Identity_Representation_Conditioned_Memory_Compensation_Network_for_Talking_Head_ICCV_2023_paper.pdf), The Hong Kong University of Science and Technology, [ProjectPage](https://harlanhong.github.io/publications/mcnet.html), [Code](https://github.com/harlanhong/ICCV2023-MCNET).​ 32 | - [ToonTalker: Cross-Domain Face Reenactment](https://openaccess.thecvf.com/content/ICCV2023/papers/Gong_ToonTalker_Cross-Domain_Face_Reenactment_ICCV_2023_paper.pdf), Tsinghua University, [ProjectPage](https://opentalker.github.io/ToonTalker/), [Code](https://github.com/yuanygong/ToonTalker).​ 33 | - [Robust One-Shot Face Video Re-enactment using Hybrid Latent Spaces of StyleGAN2](https://openaccess.thecvf.com/content/ICCV2023/papers/Oorloff_Robust_One-Shot_Face_Video_Re-enactment_using_Hybrid_Latent_Spaces_of_ICCV_2023_paper.pdf), University of Maryland, [ProjectPage](https://trevineoorloff.github.io/FaceVideoReenactment_HybridLatents.io/). 34 | - [EMMN: Emotional Motion Memory Network for Audio-driven Emotional Talking Face Generation](https://openaccess.thecvf.com/content/ICCV2023/papers/Tan_EMMN_Emotional_Motion_Memory_Network_for_Audio-driven_Emotional_Talking_Face_ICCV_2023_paper.pdf), Shanghai Jiao Tong University. 35 | - [EmoTalk: Speech-Driven Emotional Disentanglement for 3D Face Animation](https://openaccess.thecvf.com/content/ICCV2023/papers/Peng_EmoTalk_Speech-Driven_Emotional_Disentanglement_for_3D_Face_Animation_ICCV_2023_paper.pdf), Renmin University of China, [ProjectPage](https://ziqiaopeng.github.io/emotalk/), [Code](https://github.com/psyai-net/EmoTalk_release). 36 | - [HyperReenact: One-Shot Reenactment via Jointly Learning to Refine and 37 | Retarget Faces](https://openaccess.thecvf.com/content/ICCV2023/papers/Bounareli_HyperReenact_One-Shot_Reenactment_via_Jointly_Learning_to_Refine_and_Retarget_ICCV_2023_paper.pdf), Kingston University London, [Code](https://github.com/StelaBou/HyperReenact). 38 | - [MODA: Mapping-Once Audio-driven Portrait Animation with Dual Attentions](https://openaccess.thecvf.com/content/ICCV2023/papers/Liu_MODA_Mapping-Once_Audio-driven_Portrait_Animation_with_Dual_Attentions_ICCV_2023_paper.pdf), International Digital Economy Academy (IDEA), [ProjectPage](https://liuyunfei.net/projects/iccv23-moda/), [Code](https://github.com/DreamtaleCore/MODA). 39 | - [SPACE: Speech-driven Portrait Animation with Controllable Expression](https://openaccess.thecvf.com/content/ICCV2023/papers/Gururani_SPACE_Speech-driven_Portrait_Animation_with_Controllable_Expression_ICCV_2023_paper.pdf), NVIDIA, [ProjectPage](https://research.nvidia.com/labs/dir/space/). 40 | ### 3D Talking-Face Synthesis (1) 41 | *keywords: efficiency*; 42 | 43 | - [Efficient Region-Aware Neural Radiance Fields for High-Fidelity Talking Portrait Synthesis](https://openaccess.thecvf.com/content/ICCV2023/html/Li_Efficient_Region-Aware_Neural_Radiance_Fields_for_High-Fidelity_Talking_Portrait_Synthesis_ICCV_2023_paper.html), Beihang University, [Code](https://github.com/Fictionarry/ER-NeRF). 44 | 45 | ## Talking-Face Video Editing (1) 46 | 47 | *keywords: GAN inversion, StyleGAN*; 48 | 49 | - [RIGID: Recurrent GAN Inversion and Editing of Real Face Videos](https://openaccess.thecvf.com/content/ICCV2023/papers/Xu_RIGID_Recurrent_GAN_Inversion_and_Editing_of_Real_Face_Videos_ICCV_2023_paper.pdf), The University of Hong Kong, [ProjectPage](https://cnnlstm.github.io/RIGID/), [Code](https://github.com/cnnlstm/RIGID). 50 | 51 | ## Talking-Body Synthesis (4) 52 | 53 | *keywords: continual learning, lively, one-shot*; 54 | 55 | - [Continual Learning for Personalized Co-Speech Gesture Generation](https://openaccess.thecvf.com/content/ICCV2023/html/Ahuja_Continual_Learning_for_Personalized_Co-speech_Gesture_Generation_ICCV_2023_paper.html), CMU, [ProjectPage](https://chahuja.com/cdiffgan/), [Dataset](https://chahuja.com/pats/).​ 56 | - [LivelySpeaker: Towards Semantic-Aware Co-Speech Gesture Generation](https://openaccess.thecvf.com/content/ICCV2023/html/Zhi_LivelySpeaker_Towards_Semantic-Aware_Co-Speech_Gesture_Generation_ICCV_2023_paper.html), ShanghaiTech University, [Code](https://github.com/zyhbili/LivelySpeaker). ​ 57 | - [DINAR: Diffusion Inpainting of Neural Textures for One-Shot Human Avatars](https://openaccess.thecvf.com/content/ICCV2023/html/Svitov_DINAR_Diffusion_Inpainting_of_Neural_Textures_for_One-Shot_Human_Avatars_ICCV_2023_paper.html), Samsung AI Center, [Code](https://github.com/SamsungLabs/DINAR).​ 58 | - [One-shot Implicit Animatable Avatars with Model-based Priors](https://openaccess.thecvf.com/content/ICCV2023/html/Huang_One-shot_Implicit_Animatable_Avatars_with_Model-based_Priors_ICCV_2023_paper.html), Zhejiang University, [Code](https://github.com/huangyangyi/ELICIT). 59 | -------------------------------------------------------------------------------- /benchmarks/readme.md: -------------------------------------------------------------------------------- 1 | # A Microbenchmark for Talking-Face Synthesis 2 | ### [**Dataset**](https://drive.google.com/drive/folders/1vBse3rgHd3JfTGNFXC-oUZs5DR9B5Mep?usp=sharing) | [**Website**](https://jason-cs18.github.io/awesome-avatar/benchmarks/) 3 | 4 | This repository contains the datasets and testing scripts for talking-face synthesis. 5 | 6 | > A microbenchmark serves as a valuable tool for researchers to conduct speedy evaluations of new algorithms. This repository can be easily customized and applied to diverse audio-visual talking-face datasets. 7 | 8 | ### Datasets 9 | In this benchmark, we collect 3 videos for English speakers and 3 videos for Chinese speakers. 10 | 11 | 14 | 15 | 16 | 17 | #### File Structure 18 | ``` 19 | ├── driving_audios 20 | | ├── [9.3M] may_english_audio.aac 21 | | ├── [3.3M] macron_english_trim_audio.aac 22 | | ├── [3.5M] obama1_english_audio.aac 23 | | ├── [780K] laoliang_chinese_50s_audio.mp3 24 | | ├── [4.3M] luoxiang_chinese_audio.mp3 25 | | ├── [8.9M] zuijiapaidang_chinese_audio.mp3 26 | ├── source_images 27 | | ├── [294K] may.png 28 | | ├── [202K] macron.png 29 | | ├── [213M] obama1.png 30 | | ├── [206K] zuijiapaidang.png 31 | | ├── [175K] luoxiang.png 32 | | ├── [204K] laoliang.png 33 | ├── reference_videos 34 | │ ├── [56M] obama1_english.mp4, 03:38.16, 25fps, 450x450, 46 sentences 35 | │ ├── [96M] may_english.mp4, 04:02.97, 25fps, 512x512, 35 sentences 36 | │ ├── [24M] macron_english_trim.mp4, 00:03:31.92, 25fps, 512x512, 49 sentences 37 | │ ├── [3.6M] laoliang_chinese_50s.mp4, 00:00:49.85, 30fps, 410x380, 40 sentences 38 | │ ├── [14M] luoxiang_chinese.mp4, 04:40.01, 25fps, 350x500, 32 sentences 39 | │ ├── [28M] zuijiapaidang_chinese.mp4, 09:41.98, 30fps, 460x450, 85 sentences 40 | ``` 41 | 42 | 43 | 44 | 45 | 46 | 47 | 48 | 49 | 50 | 51 | 52 | 53 | 54 | 55 | 56 | 57 | 58 | 59 | 60 | 61 | 62 | 63 | 64 | 65 | 66 | 67 | 68 | 69 |
English Speakers
obama1_english.mp4
may_english.mp4
macron_english.mp4
Chinese Speakers
laoliang_chinese.mp4
luoxiang_chinese.mp4
zuijiapaidang_chinese.mp4
70 | 71 | ### Benchmark 72 | To measure the performance of Wav2Lip and SadTalker, we run them on all videos and testing with the following metrics: 73 | - **Sync↑**: The confidence score from SyncNet (lip-sync); 74 | - **PSNR↑**: Peak signal-to-noise ratio (identity-preserving); 75 | - **SSIM↑**: Structural similarity for image (identity-preserving); 76 | - **FID↓**: Frchet inception distance (image quality); 77 | 78 | ### Implementation (off-the-shelf tools) 79 | 1. Sync: [syncnet_python](https://github.com/joonson/syncnet_python) ![Github stars](https://img.shields.io/github/stars/joonson/syncnet_python.svg) 80 | 2. PSNR, SSIM: [ffmpeg-quality-metrics](https://github.com/slhck/ffmpeg-quality-metrics) ![Github stars](https://img.shields.io/github/stars/slhck/ffmpeg-quality-metrics.svg) 81 | 3. FID, ~~IS~~: [IQA-PyTorch](https://github.com/chaofengc/IQA-PyTorch) ![Github stars](https://img.shields.io/github/stars/chaofengc/IQA-PyTorch.svg) 82 | 83 | 84 | ### Qualitative Results for One-shot Pipelines 85 | 86 | 87 | 88 | 89 | 90 | 91 | 92 | 93 | 94 | 95 | 96 | 97 | 98 | 99 | 100 | 101 | 102 | 103 | 104 | 105 | 106 | 107 | 108 | 109 | 110 | 111 | 112 | 113 | 114 | 115 | 116 |
English Speakers
obama1_Wav2Lip.mp4
PSNR: 32.287, SSIM: 0.951, FID: 18.993
may_Wav2Lip.mp4
PSNR: 32.572, SSIM: 0.936, FID: 33.941
macron_Wav2Lip.mp4
PSNR: 35.737, SSIM: 0.969, FID: 6.121
Chinese Speakers
laoliang_Wav2Lip.mp4
PSNR: 31.444, SSIM: 0.939, FID: 19.192
luoxiang_Wav2Lip.mp4
PSNR: 34.367, SSIM: 0.971, FID: 23.631
zuijiapaidang_Wav2Lip.mp4
PSNR: 20.364, SSIM: 0.783, FID: 49.04
117 | 118 | 119 | 120 | 121 | 122 | 123 | 124 | 125 | 126 | 127 | 128 | 129 | 130 | 131 | 132 | 133 | 134 | 135 | 136 | 137 | 138 | 139 | 140 | 141 | 142 | 143 | 144 | 145 | 146 | 147 |
English Speakers
obama1_SadTalker.mp4
PSNR: 20.587, SSIM: 0.754, FID: 24.051
may_SadTalker.mp4
PSNR: 19.211, SSIM: 0.701, FID: 46.182
macron_SadTalker.mp4
PSNR: 18.729, SSIM: 0.763, FID: 98.982
Chinese Speakers
laoliang_SadTalker.mp4
PSNR: 18.536, SSIM: 0.672, FID: 52.362
luoxiang_SadTalker.mp4
PSNR: 14.363, SSIM: 0.598, FID: 104.221
zuijiapaidang_SadTalker.mp4
PSNR: 17.359, SSIM: 0.725, FID: 4.781
148 | 149 | ### Quantitative Results for One-shot Pipelines 150 | 151 | 152 | 153 | 154 | 155 | 156 | 157 | 158 | 159 | 160 | 161 | 162 | 163 | 164 | 165 | 166 | 167 | 168 | 169 | 170 | 171 | 172 | 173 | 174 | 175 | 176 | 177 | 178 | 179 | 180 | 181 | 182 | 183 | 184 | 185 | 186 | 187 | 188 | 189 | 190 | 191 | 192 | 193 | 194 | 195 |
English Speakers
Chinese Speakers
PipelineSync↑PSNR↑SSIM↑FID↓PipelineSync↑PSNR↑SSIM↓FID↓
Wav2Lipxxx33.5320.95219.685Wav2Lipxxx28.7250.89730.621
SadTalkerxxx19.5090.73956.407SadTalerxxx16.7530.66568.120
196 | 197 | 198 | *Because NeRF based renderers (GeneFace and ER-NeRF) are person-dependent, we train them on the first 3 minutes of marcon and zuijiapaidang respectively.* 199 | 200 | 201 | ### Qualitative Results for Few-shot Pipelines 202 | 203 | 204 | 205 | 206 | 207 | 208 | 209 | 210 | 211 | 212 | 213 | 214 | 215 | 216 | 217 |
English Speakers
marcon_GeneFace.mp4macron_ER-NeRF.mp4
Chinese Speakers
zuijiapaidang_GeneFace.mp4zuijiapaidang_ER-NeRF.mp4
218 | 219 | 220 | ### Quantitative Results for Few-shot Pipelines 221 | 222 | 223 | 224 | 225 | 226 | 227 | 228 | 229 | 230 | 231 | 232 | 233 | 234 | 235 | 236 | 237 | 238 | 239 | 240 | 241 | 242 | 243 | 244 | 245 | 246 | 247 | 248 | 249 | 250 | 251 | 252 | 253 | 254 | 255 | 256 | 257 | 258 | 259 | 260 | 261 | 262 | 263 | 264 | 265 | 266 | 267 | 268 |
marcon (English)
zuijiapaidang (Chinese)
PipelineSync↑PSNR↑SSIM↓FID↓IS↑PipelineSync↑PSNR↑SSIM↓FID↓IS↑
GeneFacexxxxxxxxxxxxxxxGeneFacexxxxxxxxxxxxxxx
ER-NeRFxxxxxxxxxxxxxxxER-NeRFxxxxxxxxxxxxxxx
269 | 270 | 271 | ## External Links 272 | 1. [Extract Frames using FFmpeg: A Comprehensive Guide](https://ottverse.com/extract-frames-using-ffmpeg-a-comprehensive-guide/) 273 | 2. [Whisper Web: ML-powered speech recognition directly in your browser](https://huggingface.co/spaces/Xenova/whisper-web) 274 | 3. [moviepy.video.fx.all.crop](https://zulko.github.io/moviepy/ref/videofx/moviepy.video.fx.all.crop.html) 275 | 4. [Trim Video: Trim or cut video of any format](https://online-video-cutter.com/) -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # awesome-avatar 2 | This is a repository for organizing papers, codes and other resources related to the topic of Avatar (talking-face and talking-body). 3 | 4 | #### 🔆 This project is still on-going, pull requests are welcomed!! 5 | If you have any suggestions (missing papers, new papers, key researchers or typos), please feel free to edit and pull a request. 6 | 7 | #### News 8 | - **2024.09.07**: add ASR and TTS tool 9 | - **2024.08.24**: add backgrounds for image/video generations 10 | - **2024.08.24**: re-organize paper list with table formating 11 | - **2024.08.24**: add works about full-body avatar synthesis 12 | 13 | 14 | #### TO DO LIST 15 | 16 | - [x] Main paper list 17 | - [x] Researchers list 18 | - [x] Toolbox for avatar 19 | - [x] Add paper link 20 | - [ ] Add [paper notes](https://github.com/Jason-cs18/awesome-avatar/tree/main/notes) 21 | - [x] Add codes if have 22 | - [x] Add project page if have 23 | - [x] Datasets and metrics 24 | - [x] Related links 25 | 26 | ## Researchers and labs 27 | 1. [NVIDIA Research](https://www.nvidia.com/en-us/research/) 28 | - Neural rendering models for human generation: [vid2vid NeurIPS'18](https://tcwang0509.github.io/vid2vid/), [fs-vid2vid NeurIPS'19](https://nvlabs.github.io/few-shot-vid2vid/), [EG3D CVPR'22](https://github.com/NVlabs/eg3d); 29 | - Talking-face synthesis: [face-vid2vid CVPR'21](https://nvlabs.github.io/face-vid2vid/), [Implicit NeurIPS'22](https://research.nvidia.com/labs/dir/implicit_warping/), [SPACE ICCV'23](https://research.nvidia.com/labs/dir/space/), [One-shot 30 | Neural Head Avatar arXiv'23](https://research.nvidia.com/labs/lpr/one-shot-avatar/); 31 | - Talking-body synthesis: [DreamPose ICCV'23](https://grail.cs.washington.edu/projects/dreampose/); 32 | - Face enhancement (relighting, restoration, etc): [Lumos SIGGRAPH Asia 2022](https://research.nvidia.com/labs/dir/lumos/), [RANA ICCV'23](https://nvlabs.github.io/RANA/); 33 | - Authorized use of synthetic videos: [Avatar Fingerprinting arXiv'23](https://research.nvidia.com/labs/nxp/avatar-fingerprinting/); 34 | 2. [Aliaksandr Siarohin @ Snap Research](https://research.snap.com/team/team-member.html#aliaksandr-siarohin) 35 | - Neural rendering models for human generation (focus on flow-based generative models): [Unsupervised-Volumetric-Animation CVPR'23](https://github.com/snap-research/unsupervised-volumetric-animation), [3DAvatarGAN CVPR'23](https://arxiv.org/abs/2301.02700), [3D-SGAN ECCV'22](https://arxiv.org/abs/2112.01422), [Articulated-Animation CVPR'21](https://arxiv.org/abs/2104.11280), [Monkey-Net CVPR'19](https://arxiv.org/abs/1812.08861), [FOMM NeurIPS'19](http://papers.nips.cc/paper/8935-first-order-motion-model-for-image-animation); 36 | 3. [Ziwei Liu @ Nanyang Technological University](https://liuziwei7.github.io/index.html) 37 | - Talking-face synthesis: [StyleSync CVPR'23](https://hangz-nju-cuhk.github.io/projects/StyleSync), [AV-CAT SIGGRAPH Asia 2022](https://hangz-nju-cuhk.github.io/projects/AV-CAT), [StyleGANX ICCV'23](https://www.mmlab-ntu.com/project/styleganex/), [StyleSwap ECCV'22](https://hangz-nju-cuhk.github.io/projects/StyleSwap), [PC-AVS CVPR'21](https://hangz-nju-cuhk.github.io/projects/PC-AVS), [Speech2Talking-Face IJCAI'21](https://www.ijcai.org/proceedings/2021/0141.pdf), [VToonify SIGGRAPH Asia 2022](https://www.youtube.com/watch?v=0_OmVhDgYuY); 38 | - Talking-body synthesis: [MotionDiffuse arXiv'22](https://mingyuan-zhang.github.io/projects/MotionDiffuse.html); 39 | - Face enhancement (relighting, restoration, etc): [Relighting4D ECCV'22](https://www.youtube.com/watch?v=NayAw89qtsY); 40 | 4. [Xiaodong Cun @ Tencent AI Lab](https://vinthony.github.io/academic/): 41 | - Talking-face synthesis: [StyleHEAT ECCV'22](https://arxiv.org/abs/2203.04036), [VideoReTalking SIGGRAPH Asia'22](https://arxiv.org/abs/2211.14758), [ToolTalking ICCV'23](https://arxiv.org/abs/2308.12866), [DPE CVPR'23](https://arxiv.org/abs/2301.06281), [CodeTalker CVPR'23](https://arxiv.org/abs/2301.06281), [SadTalker CVPR'23](https://arxiv.org/abs/2211.12194); 42 | - Talking-body synthesis: [LivelySpeaker ICCV'23](https://arxiv.org/abs/2306.00926); 43 | 45 | 5. Max Planck Institute for Informatics: 46 | - 3D face models (*e.g.,* 3DMM): [FLAME SIGGRAPH Asia 2017](https://flame.is.tue.mpg.de/); 47 | 48 | ## Papers 49 | 50 | ### Image and video generation 51 | |Model|Paper|Blog|Codebase|Note| 52 | |:---:|:---:|:---:|:---:|:---:| 53 | |StyleGANv3|[Alias-Free Generative Adversarial Networks](https://nvlabs.github.io/stylegan3/), NVIDIA, NeurIPS 2021|[The Evolution of StyleGAN: Introduction](https://blog.paperspace.com/evolution-of-stylegan/)|[Code](https://github.com/NVlabs/stylegan3)|high fidlity face generation| 54 | |EG3D|[EG3D: Efficient Geometry-aware 3D Generative Adversarial Networks](https://nvlabs.github.io/eg3d/), NVIDIA, CVPR 2022||[Code](https://github.com/NVlabs/eg3d)|3D-aware GAN| 55 | |Stable Diffusion|[High-Resolution Image Synthesis with Latent Diffusion Models](https://arxiv.org/pdf/2112.10752), Heidelberg University, CVPR 2022|[What are Diffusion Models?](https://lilianweng.github.io/posts/2021-07-11-diffusion-models/)|[Code](https://github.com/CompVis/latent-diffusion)|diverse and high quality images| 56 | |Stable Video Diffusion|[Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets](https://arxiv.org/abs/2311.15127), Stability AI, arXiv 2023|[Diffusion Models for Video Generation](https://lilianweng.github.io/posts/2024-04-12-diffusion-video/)|[Code](https://github.com/Stability-AI/generative-models)|| 57 | |DiT|[Scalable Diffusion Models with Transformers](https://arxiv.org/abs/2212.09748), Meta, ICCV 2023|[Diffusion Transformed](https://www.deeplearning.ai/the-batch/a-new-class-of-diffusion-models-based-on-the-transformer-architecture/)|[Code](https://github.com/facebookresearch/DiT)|magic behind OpenAI Sora| 58 | |VQ-VAE|[Neural Discrete Representation Learning](https://arxiv.org/pdf/1711.00937), DeepMind, NIPS 2017|[OpenAI's DALL-E 2 and DALL-E 1 Explained](https://vaclavkosar.com/ml/openai-dall-e-2-and-dall-e-1)||magic behinds OpenAI DALL-E| 59 | |NeRF|[NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis](https://arxiv.org/abs/2003.08934), UC Berkeley, ECCV 2020|[NeRF Explosion 2020](https://dellaert.github.io/NeRF/)|[Code](https://github.com/yenchenlin/nerf-pytorch)|3D synthesis via volume rendering| 60 | |3DGS|[3D Gaussian Splatting for Real-Time Radiance Field Rendering](https://arxiv.org/abs/2308.04079), Inria, SIGGRAPH 2023|[A Comprehensive Overview of Gaussian Splatting](https://towardsdatascience.com/a-comprehensive-overview-of-gaussian-splatting-e7d570081362)|[Code](https://github.com/graphdeco-inria/gaussian-splatting)|real-time 3d rendering| 61 | 62 | ### 3D Avatar (face+body) 63 | |Conference|Paper|Affiliation|Codebase|Notes| 64 | |:---:|:---:|:---:|:---:|:---:| 65 | |CVPR 2021|[Function4D: Real-time Human Volumetric Capture from Very Sparse Consumer RGBD Sensors](https://www.liuyebin.com/Function4D/Function4D.html)|Tsinghua University|[Dataset](https://github.com/ytrock/THuman2.0-Dataset)|| 66 | |ECCV 2022|[HuMMan: Multi-Modal 4D Human Dataset for Versatile Sensing and Modeling](https://caizhongang.com/projects/HuMMan/)|Shanghai Artificial Intelligence Laboratory|[Dataset](https://caizhongang.com/projects/HuMMan/)|| 67 | |SIGGRAPH 2023|[AvatarReX: Real-time Expressive Full-body Avatars](https://liuyebin.com/AvatarRex/)|Tsinghua University|[Dataset](https://github.com/lizhe00/AnimatableGaussians/blob/master/AVATARREX_DATASET.md)|| 68 | |arXiv 2024|[A Survey on 3D Human Avatar Modeling - From Reconstruction to Generation](https://arxiv.org/pdf/2406.04253 )|The University of Hong Kong ||| 69 | |arXiv 2024|[From Audio to Photoreal Embodiment: Synthesizing Humans in Conversations](https://people.eecs.berkeley.edu/~evonne_ng/projects/audio2photoreal/static/CCA.pdf)|Meta Reality Labs Research|[Code](https://github.com/facebookresearch/audio2photoreal/) ![Github stars](https://img.shields.io/github/stars/facebookresearch/audio2photoreal.svg) ![Github forks](https://img.shields.io/github/forks/facebookresearch/audio2photoreal.svg)|conversational avatar| 70 | |CVPR 2024|[Animatable Gaussians: Learning Pose-dependent Gaussian Maps for High-fidelity Human Avatar Modeling](https://github.com/lizhe00/AnimatableGaussians?tab=readme-ov-file)|Tsinghua Univserity|[Code](https://github.com/lizhe00/AnimatableGaussians?tab=readme-ov-file) ![Github stars](https://img.shields.io/github/stars/lizhe00/AnimatableGaussians.svg) ![Github forks](https://img.shields.io/github/forks/lizhe00/AnimatableGaussians.svg)|| 71 | |CVPR 2024|[4K4D: Real-Time 4D View Synthesis at 4K Resolution](https://drive.google.com/file/d/1Y-C6ASIB8ofvcZkyZ_Vp-a2TtbiPw1Yx/view?usp=sharing)|Zhejiang University|[Code](https://github.com/zju3dv/4K4D) ![Github stars](https://img.shields.io/github/stars/zju3dv/4K4D.svg) ![Github forks](https://img.shields.io/github/forks/zju3dv/4K4D.svg)|real-time synthesis with 3DGS| 72 | 73 | 74 | ### 2D talking-face synthesis 75 | 76 | |Conference|Paper|Affiliation|Codebase|Training Code|Notes| 77 | |:---:|:---:|:---:|:---:|:---:|:---| 78 | |MM 2020|[Wav2Lip: Accurately Lip-sync Videos to Any Speech](https://arxiv.org/abs/2008.10010)|The International Institute of Islamic Thought (IIIT), India|[Code](https://github.com/Rudrabha/Wav2Lip) ![Github stars](https://img.shields.io/github/stars/Rudrabha/Wav2Lip.svg) ![Github forks](https://img.shields.io/github/forks/Rudrabha/Wav2Lip.svg)|✅|most accurate lip-sync model, bad video quality `96*96`, pre-trained on ~`180` hours video data from [LRS2](https://www.robots.ox.ac.uk/~vgg/data/lip_reading/lrs2.html)| 79 | |MM 2021|[Imitating Arbitrary Talking Style for Realistic Audio-Driven Talking Face Synthesis](https://hcsi.cs.tsinghua.edu.cn/Paper/Paper21/MM21-WUHAOZHE.pdf)|Tsinghua University|[Code](https://github.com/wuhaozhe/style_avatar), ![Github stars](https://img.shields.io/github/stars/wuhaozhe/style_avatar.svg) ![Github forks](https://img.shields.io/github/forks/wuhaozhe/style_avatar.svg)||| 80 | |CVPR 2021|[Pose-Controllable Talking Face Generation by Implicitly Modularized Audio-Visual Representation](https://arxiv.org/abs/2104.11116)|The Chinese University of Hong Kong|[Code](https://github.com/Hangz-nju-cuhk/Talking-Face_PC-AVS) ![Github stars](https://img.shields.io/github/stars/Hangz-nju-cuhk/Talking-Face_PC-AVS.svg) ![Github forks](https://img.shields.io/github/forks/Hangz-nju-cuhk/Talking-Face_PC-AVS.svg)||contrastive learning on audio-lip| 81 | |ICCV 2021|[PIRenderer: Controllable Portrait Image Generation via Semantic Neural Rendering](https://arxiv.org/abs/2109.08379)|Peking University|[Code](https://github.com/RenYurui/PIRender) ![Github stars](https://img.shields.io/github/stars/RenYurui/PIRender.svg) ![Github forks](https://img.shields.io/github/forks/RenYurui/PIRender.svg)||| 82 | |ECCV 2022|[StyleHEAT: One-Shot High-Resolution Editable Talking Face Generation via Pre-trained StyleGAN](https://arxiv.org/pdf/2203.04036.pdf)|Tsinghua University|[Code](https://github.com/OpenTalker/StyleHEAT) ![Github stars](https://img.shields.io/github/stars/OpenTalker/StyleHEAT.svg) ![Github forks](https://img.shields.io/github/forks/OpenTalker/StyleHEAT.svg)||High-fidenity synthesis via StyleGAN| 83 | |SIGGRAPH Asia 2022|[VideoReTalking: Audio-based Lip Synchronization for Talking Head Video Editing In the Wild](https://github.com/OpenTalker/video-retalking)|Xidian University|[Code](https://github.com/OpenTalker/video-retalking) ![Github stars](https://img.shields.io/github/stars/OpenTalker/video-retalking.svg) ![Github forks](https://img.shields.io/github/forks/OpenTalker/video-retalking.svg)||| 84 | |AAAI 2023|[DINet: Deformation Inpainting Network for Realistic Face Visually Dubbing on High Resolution Video](https://fuxivirtualhuman.github.io/pdf/AAAI2023_FaceDubbing.pdf)|Virtual Human Group, Netease Fuxi AI Lab|[Code](https://github.com/MRzzm/DINet)![Github stars](https://img.shields.io/github/stars/MRzzm/DINet.svg) ![Github forks](https://img.shields.io/github/forks/MRzzm/DINet.svg)|✅|accurate lip-sync and high-quality synthesis (`256*256`)| 85 | |CVPR 2023|[SadTalker: Learning Realistic 3D Motion Coefficients for Stylized Audio-Driven Single Image Talking Face Animation](https://arxiv.org/pdf/2211.12194.pdf)|Xi'an Jiaotong University|[Code](https://github.com/Winfredy/SadTalker) ![Github stars](https://img.shields.io/github/stars/OpenTalker/SadTalker.svg) ![Github forks](https://img.shields.io/github/forks/OpenTalker/SadTalker.svg), [Note](https://github.com/Jason-cs18/awesome-avatar/blob/main/notes/sadtalker.md)||| 86 | |arXiv 2023|[DreamTalk: When Expressive Talking Head Generation Meets Diffusion Probabilistic Models](https://arxiv.org/abs/2312.09767)|Tsinghua University|[Code](https://github.com/ali-vilab/dreamtalk), ![Github stars](https://img.shields.io/github/stars/ali-vilab/dreamtalk.svg) ![Github forks](https://img.shields.io/github/forks/ali-vilab/dreamtalk.svg)||diffusion| 87 | |||Tencent TMElyralab|[MuseTalk: Real-Time High Quality Lip Synchorization with Latent Space Inpainting](https://github.com/TMElyralab/MuseTalk) ![Github stars](https://img.shields.io/github/stars/TMElyralab/MuseTalk.svg) ![Github forks](https://img.shields.io/github/forks/TMElyralab/MuseTalk.svg) ||| 88 | |arXiv 2024|[LivePortrait: Efficient Portrait Animation with Stitching and Retargeting Control](https://arxiv.org/abs/2407.03168)|Kuaishou Technology|[Code](https://github.com/KwaiVGI/LivePortrait) ![Github stars](https://img.shields.io/github/stars/KwaiVGI/LivePortrait.svg) ![Github forks](https://img.shields.io/github/forks/KwaiVGI/LivePortrait.svg) ||face reenactment with micro-expression| 89 | |arXiv 2024|[EchoMimic: Lifelike Audio-Driven Portrait Animations through Editable Landmark Conditions](https://arxiv.org/abs/2407.08136)|Ant Group|[Code](https://github.com/BadToBest/EchoMimic) ![Github stars](https://img.shields.io/github/stars/BadToBest/EchoMimic.svg) ![Github forks](https://img.shields.io/github/forks/BadToBest/EchoMimic.svg)||accurate lip-sync on Chinese speakers, diffusion, pre-trained on `540 hours` cleaned video data (collected from internet)| 90 | |arXiv 2024|[Hallo: Hierarchical Audio-Driven Visual Synthesis for Portrait Image Animation](https://arxiv.org/abs/2407.08136)|Fudan University|[Code](https://github.com/fudan-generative-vision/hallo), ![Github stars](https://img.shields.io/github/stars/fudan-generative-vision/hallo.svg) ![Github forks](https://img.shields.io/github/forks/fudan-generative-vision/hallo.svg)|✅|accurate lip-sync, diffusion, pre-trained on `264 hours` of cleaned video data (155 hours from internet and 9 hours from HDTF)| 91 | |[arXiv 2024]|[Loopy: Taming Audio-Driven Portrait Avatar with Long-Term Motion Dependency](https://loopyavatar.github.io/)|Zhejiang University and ByteDance|||expressive animation driven by audio only, pre-trained on `160 hours` of cleaned video data (collected from internet)| 92 | 93 | ### 3D talking-face synthesis 94 | |Conference|Paper|Affiliation|Codebase|Notes| 95 | |:---:|:---:|:---:|:---:|:---:| 96 | |ICCV 2021|[AD-NeRF: Audio Driven Neural Radiance Fields for Talking Head Synthesis](https://arxiv.org/pdf/2103.11078)|University of Science and Technology of China|[Code](https://github.com/YudongGuo/AD-NeRF)![Github stars](https://img.shields.io/github/stars/YudongGuo/AD-NeRF.svg)![Github forks](https://img.shields.io/github/forks/YudongGuo/AD-NeRF.svg)|| 97 | |ECCV 2022|[Learning Dynamic Facial Radiance Fields for Few-Shot Talking Head Synthesis](https://github.com/sstzal/DFRF/blob/show_page/images/DFRF_eccv2022.pdf)|Tsinghua University|[Code](https://github.com/sstzal/DFRF)![Github stars](https://img.shields.io/github/stars/sstzal/DFRF.svg)![Github forks](https://img.shields.io/github/forks/sstzal/DFRF.svg)|| 98 | |ICLR 2023|[GeneFace: Generalized and High-Fidelity Audio-Driven 3D Talking Face Synthesis](https://arxiv.org/pdf/2301.13430)|Zhejiang University|[Code](https://github.com/yerfor/GeneFace)![Github stars](https://img.shields.io/github/stars/yerfor/GeneFace.svg)![Github forks](https://img.shields.io/github/forks/yerfor/GeneFace.svg)|| 99 | |ICCV 2023|[Efficient Region-Aware Neural Radiance Fields for High-Fidelity Talking Portrait Synthesis](https://openaccess.thecvf.com/content/ICCV2023/html/Li_Efficient_Region-Aware_Neural_Radiance_Fields_for_High-Fidelity_Talking_Portrait_Synthesis_ICCV_2023_paper.html)|Beihang University|[Code](https://github.com/Fictionarry/ER-NeRF)![Github stars](https://img.shields.io/github/stars/Fictionarry/ER-NeRF.svg)![Github forks](https://img.shields.io/github/forks/Fictionarry/ER-NeRF.svg)|| 100 | |arXiv 2023|[GeneFace++: Generalized and Stable Real-Time Audio-Driven 3D Talking Face Generation](https://arxiv.org/pdf/2305.00787)|Zhejiang University|[Code](https://github.com/yerfor/GeneFacePlusPlus)![Github stars](https://img.shields.io/github/stars/yerfor/GeneFacePlusPlus.svg)![Github forks](https://img.shields.io/github/forks/yerfor/GeneFacePlusPlus.svg)|| 101 | |CVPR 2024|[SyncTalk: The Devil is in the Synchronization for Talking Head Synthesi](https://arxiv.org/pdf/2311.17590)|Renmin University of China|[Code](https://github.com/ziqiaopeng/SyncTalk)![Github stars](https://img.shields.io/github/stars/ziqiaopeng/SyncTalk.svg)![Github forks](https://img.shields.io/github/forks/ziqiaopeng/SyncTalk.svg)|| 102 | |ECCV 2024|[TalkingGaussian: Structure-Persistent 3D Talking Head Synthesis via Gaussian Splatting](https://github.com/Fictionarry/TalkingGaussian)|Beihang University|[Code](https://github.com/Fictionarry/TalkingGaussian)![Github stars](https://img.shields.io/github/stars/Fictionarry/TalkingGaussian.svg)![Github forks](https://img.shields.io/github/forks/Fictionarry/TalkingGaussian.svg)|| 103 | 104 | ### Talking-body synthesis 105 | 106 | #### Pose2video 107 | 108 | |Conference|Paper|Affiliation|Codebase|Notes| 109 | |:---:|:---:|:---:|:---:|:---:| 110 | |NeurIPS 2018|[Video-to-Video Synthesis](https://github.com/NVIDIA/vid2vid)|NVIDIA|[Code](https://github.com/NVIDIA/vid2vid) ![Github stars](https://img.shields.io/github/stars/NVIDIA/vid2vid.svg) ![Github forks](https://img.shields.io/github/forks/NVIDIA/vid2vid.svg)|| 111 | |ICCV 2019|[Everybody Dance Now](https://github.com/carolineec/EverybodyDanceNow)|UC Berkeley|[Code](https://github.com/carolineec/EverybodyDanceNow)![Github stars](https://img.shields.io/github/stars/carolineec/EverybodyDanceNow.svg)![Github forks](https://img.shields.io/github/forks/carolineec/EverybodyDanceNow.svg)|| 112 | |arXiv 2023|[Animate Anyone: Consistent and Controllable Image-to-Video Synthesis for Character Animation](https://arxiv.org/pdf/2311.17117.pdf)|Alibaba Group|[Code](https://github.com/HumanAIGC/AnimateAnyone)![Github stars](https://img.shields.io/github/stars/HumanAIGC/AnimateAnyone.svg)![Github forks](https://img.shields.io/github/forks/HumanAIGC/AnimateAnyone.svg)|| 113 | |CVPR 2024|[MagicAnimate: Temporally Consistent Human Image Animation using Diffusion Model](https://github.com/magic-research/magic-animate/blob/main/assets/preprint/MagicAnimate.pdf)|National University of Singapore|[Code](https://github.com/magic-research/magic-animate)![Github stars](https://img.shields.io/github/stars/magic-research/magic-animate.svg)![Github forks](https://img.shields.io/github/forks/magic-research/magic-animate.svg)|| 114 | |arXiv 2024|[Champ: Controllable and Consistent Human Image Animation with 3D Parametric Guidance](https://arxiv.org/pdf/2403.14781)|Nanjing University|[Code](https://github.com/fudan-generative-vision/champ)![Github stars](https://img.shields.io/github/stars/fudan-generative-vision/champ.svg)![Github forks](https://img.shields.io/github/forks/fudan-generative-vision/champ.svg)|| 115 | |Github repo|[MuseV: Infinite-length and High Fidelity Virtual Human Video Generation with Visual Conditioned Parallel Denoising](https://github.com/TMElyralab/MuseV)|Tencent TMElyralab|[Code](https://github.com/TMElyralab/MuseV)![Github stars](https://img.shields.io/github/stars/TMElyralab/MuseV.svg)![Github forks](https://img.shields.io/github/forks/TMElyralab/MuseV.svg)|| 116 | |Github repo|[MusePose: a Pose-Driven Image-to-Video Framework for Virtual Human Generation](https://github.com/TMElyralab/MusePose)|Tencent|[Code](https://github.com/TMElyralab/MusePose)![Github stars](https://img.shields.io/github/stars/TMElyralab/MusePose.svg)![Github forks](https://img.shields.io/github/forks/TMElyralab/MusePose.svg) ⭐|| 117 | |arXiv 2024|[ControlNeXt: Powerful and Efficient Control for Image and Video Generation](https://pbihao.github.io/projects/controlnext/index.html)|The Chinese University of Hong Kong|[Code](https://github.com/dvlab-research/ControlNeXt)![Github stars](https://img.shields.io/github/stars/dvlab-research/ControlNeXt.svg)![Github forks](https://img.shields.io/github/forks/dvlab-research/ControlNeXt.svg)|stable video diffusion| 118 | |[arXiv 2024]|[CyberHost: Taming Audio-driven Avatar Diffusion Model with Region Codebook Attention](https://cyberhost.github.io/)|Zhejiang University and ByteDance||pre-trained on `200 hours` video data and more than `10k` unique identities| 119 | 120 | ## Datasets 121 | 122 | ### Talking-face 123 | 127 | 128 | 129 | 130 | 131 | 132 | 133 | 134 | 135 | 136 | 137 | 138 | 139 | 140 | 141 | 142 | 143 | 144 | 145 | 146 | 147 | 148 | 149 | 150 | 151 | 152 | 153 | 154 | 155 | 156 | 157 | 158 | 159 | 160 | 161 | 162 | 163 | 164 | 165 | 166 | 167 | 168 | 169 | 170 | 171 | 172 | 173 | 174 | 175 | 176 | 177 | 178 | 179 | 180 | 181 | 182 | 183 | 184 | 185 | 186 | 187 | 188 | 189 | 190 | 191 | 192 | 193 | 194 | 195 | 196 | 197 | 198 | 199 | 200 | 201 | 202 | 203 | 204 | 205 | 206 | 207 | 208 | 209 | 210 | 211 | 212 | 213 | 214 | 215 | 216 | 217 | 218 | 219 | 220 | 221 | 222 | 223 | 224 | 225 | 226 | 227 | 228 | 229 | 230 | 231 | 232 | 233 | 234 | 264 |
Audio-Visual Datasets for Enlish Speakers
Dataset nameEnvironmentYearResolutionSubjectDurationSentence
VoxCeleb1Wild2017360p~720p1251352 hours100k
VoxCeleb2Wild2018360p~720p61122442 hours1128k
HDTFWild2020720p~1080p300+15.8 hours
LSPWild2021720p~1080p418 minutes100k
Audio-Visual Datasets for Chinese Speakers
Dataset nameEnvironmentYearResolutionSubjectDurationSentence
CMLRLab201911102k
MAVDLab20231920x10806424 hours12k
CN-CelebWild202030001200 hours
CN-Celeb-AVWild20231136660 hours
CN-CVSWild20232500+300+ hours
265 | 266 | 267 | ## Metrics 268 | 269 | ### Talking-face 270 | 271 | 272 | 273 | 274 | 275 | 276 | 277 | 278 | 279 | 280 | 281 | 282 | 283 | 284 | 285 | 286 | 287 | 288 | 289 | 290 | 291 | 292 | 293 | 294 | 295 | 296 | 297 | 298 | 299 | 300 | 301 | 302 | 303 | 304 | 305 | 306 | 307 | 308 | 309 | 310 | 311 | 312 | 313 | 314 | 315 | 316 | 317 | 318 | 319 | 320 | 321 | 322 | 323 | 324 | 325 | 326 | 327 | 328 | 329 | 330 | 331 | 332 | 333 | 334 | 335 | 336 | 337 | 338 | 339 | 340 | 341 | 342 | 343 | 344 | 345 | 346 | 347 | 348 | 349 | 350 | 351 | 352 | 353 | 354 | 355 | 356 | 357 | 358 | 359 | 360 | 361 | 362 | 363 | 364 | 365 | 366 | 367 | 368 | 369 | 370 | 371 | 372 | 373 | 374 | 375 | 376 | 377 | 378 | 379 | 380 |
Lip-Sync
Metric nameDescriptionCode/Paper
LMD↓Mouth landmark distance
LMD↓Mouth landmark distance
MA↑The Insertion-over-Union (IoU) for the overlap between the predicted mouth area and the ground truth area
Sync↑The confidence score from SyncNet (Sync)wav2lip
LSE-C↑Lip Sync Error - Confidencewav2lip
LSE-D↓Lip Sync Error - Distancewav2lip
Image Quality (identity preserving)
Metric nameDescriptionCode/Paper
MAE↓Mean Absolute Error metric for imagemmagic
MSE↓Mean Squared Error metric for imagemmagic
PSNR↑Peak Signal-to-Noise Ratiommagic
SSIM↑Structural similarity for imagemmagic
FID↓Frchet Inception Distancemmagic
IS↑Inception score mmagic
NIQE↓Natural Image Quality Evaluator metricmmagic
CSIM↑The cosine similarity of identity embeddingInsightFace
CPBD↑The cumulative probability blur detectionpython-cpbd
Diversity
Metric nameDescriptionCode/Paper
Diversity of head motions↑A standard deviation of the head motion feature embeddings extracted from the generated frames using Hopenet (Ruiz et al., 2018) is calculatedSadTalker
Beat Align Score↑The alignment of the audio and generated head motions is calculated in Bailando (Siyao et al., 2022)SadTalker
381 | 382 | ## Toolbox 383 | 1. A general toolbox for AIGC, including common metrics and models https://github.com/open-mmlab/mmagic 384 | 2. face3d: Python tools for processing 3D face https://github.com/yfeng95/face3d 385 | 3. 3DMM model fitting using Pytorch https://github.com/ascust/3DMM-Fitting-Pytorch 386 | 4. OpenFace: a facial behavior analysis toolkit https://github.com/TadasBaltrusaitis/OpenFace 387 | 5. autocrop: Automatically detects and crops faces from batches of pictures https://github.com/leblancfg/autocrop 388 | 6. OpenPose: Real-time multi-person keypoint detection library for body, face, hands, and foot estimation https://github.com/CMU-Perceptual-Computing-Lab/openpose 389 | 7. GFPGAN: Practical Algorithm for Real-world Face Restoration https://github.com/TencentARC/GFPGAN 390 | 8. CodeFormer: Robust Blind Face Restoration https://github.com/sczhou/CodeFormer 391 | 9. metahuman-stream: Real time interactive streaming digital human https://github.com/lipku/metahuman-stream 392 | 10. EasyVolcap: a PyTorch library for accelerating neural volumetric video research https://github.com/zju3dv/EasyVolcap 393 | 11. 3D Model in gradio https://www.gradio.app/guides/how-to-use-3D-model-component 394 | 395 | ### Automatic Speech Recognition (ASR) 396 | 1. BELLE-2/Belle-whisper-large-v3-zh https://huggingface.co/BELLE-2/Belle-whisper-large-v3-zh 397 | 2. SenseVoice (multilingual) https://github.com/FunAudioLLM/SenseVoice 👍👍 398 | 399 | ### Text to Speech (TTS) 400 | 1. CosyVoice, Alibaba Tongyi SpeechTeam https://github.com/FunAudioLLM/CosyVoice 👍👍 401 | 2. FireRedTTS, FireReadTeam https://github.com/FireRedTeam/FireRedTTS 402 | 3. GPT-SoVITS https://github.com/RVC-Boss/GPT-SoVITS?tab=readme-ov-file 403 | 404 | ### Speech to Speech (GPT4-o) 405 | 1. Mini-Omni, Tsinghua University https://github.com/gpt-omni/mini-omni 406 | 2. Speech To Speech, HuggingFace https://github.com/huggingface/speech-to-speech 407 | 408 | ## Related Links 409 | If you are interested in avatar and digital human, we would also like to recommend you to check out other related collections: 410 | - awesome digital human https://github.com/weihaox/awesome-digital-human 411 | --------------------------------------------------------------------------------