├── notes
    ├── moda.md
    ├── readme.md
    ├── example.md
    ├── StyleHEAT note.md
    └── sadtalker.md
├── advances
    ├── README.md
    ├── tutorials.md
    └── iccv23_papers.md
├── benchmarks
    ├── utils
    │   ├── __init__.py
    │   ├── __pycache__
    │   │   ├── __init__.cpython-39.pyc
    │   │   └── video_processing.cpython-39.pyc
    │   └── video_processing.py
    ├── assets
    │   └── file_structure.png
    ├── test.ipynb
    └── readme.md
├── assets
    ├── StyleHEAT.png
    └── sadtalker.png
├── SUMMARY.md
├── memory.md
└── README.md


/notes/moda.md:
--------------------------------------------------------------------------------
1 | 


--------------------------------------------------------------------------------
/advances/README.md:
--------------------------------------------------------------------------------
1 | # advances
2 | 
3 | 


--------------------------------------------------------------------------------
/benchmarks/utils/__init__.py:
--------------------------------------------------------------------------------
1 | from .video_processing import count_sentences


--------------------------------------------------------------------------------
/assets/StyleHEAT.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Jason-cs18/awesome-avatar/HEAD/assets/StyleHEAT.png


--------------------------------------------------------------------------------
/assets/sadtalker.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Jason-cs18/awesome-avatar/HEAD/assets/sadtalker.png


--------------------------------------------------------------------------------
/benchmarks/assets/file_structure.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Jason-cs18/awesome-avatar/HEAD/benchmarks/assets/file_structure.png


--------------------------------------------------------------------------------
/benchmarks/utils/__pycache__/__init__.cpython-39.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Jason-cs18/awesome-avatar/HEAD/benchmarks/utils/__pycache__/__init__.cpython-39.pyc


--------------------------------------------------------------------------------
/benchmarks/utils/__pycache__/video_processing.cpython-39.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Jason-cs18/awesome-avatar/HEAD/benchmarks/utils/__pycache__/video_processing.cpython-39.pyc


--------------------------------------------------------------------------------
/benchmarks/utils/video_processing.py:
--------------------------------------------------------------------------------
 1 | def count_sentences(text_url):
 2 |     """
 3 |     Counts the number of sentences in a given text.
 4 |     :param text: The text to count the sentences in.
 5 |     :return: The number of sentences in the text.
 6 |     """
 7 |     with open(text_url, 'r') as f:
 8 |         text = f.readlines()
 9 |     
10 |     return len(text[0].split("."))


--------------------------------------------------------------------------------
/SUMMARY.md:
--------------------------------------------------------------------------------
 1 | # Table of contents
 2 | 
 3 | * [awesome-avatar](README.md)
 4 | * [advances](advances/README.md)
 5 |   * [ICCV'23, Oct 4-6, 2023](advances/iccv23\_papers.md)
 6 | * [Paper note](notes/readme.md)
 7 |   * [Title](notes/example.md)
 8 |   * [moda](notes/moda.md)
 9 |   * [SadTalker: Learning Realistic 3D Motion Coefficients for Stylized Audio-Driven Single Image Talking](notes/sadtalker.md)
10 | 


--------------------------------------------------------------------------------
/notes/readme.md:
--------------------------------------------------------------------------------
1 | # Paper note
2 | All notes follow the style of [DeeplearningAI](https://www.deeplearning.ai/the-batch/tag/research/).
3 | 
4 | - [Note Template](https://github.com/Jason-cs18/awesome-avatar/blob/main/notes/example.md)
5 | - [SadTalker CVPR'23](https://github.com/Jason-cs18/awesome-avatar/blob/main/notes/sadtalker.md)
6 | - [MODA ICCV'23](https://github.com/Jason-cs18/awesome-avatar/blob/main/notes/moda.md)
7 | 


--------------------------------------------------------------------------------
/memory.md:
--------------------------------------------------------------------------------
1 | # Memory-Augmented Talking Face Synthesis
2 | - [arXiv 2020.10] [Audio-driven Talking Face Video Generation with Learning-based Personalized Head Pose](https://arxiv.org/pdf/2002.10137.pdf) | Tsinghua University
3 | - [AAAI'22 Oral] [SyncTalkFace: Talking Face Generation with Precise Lip-Syncing via Audio-Lip Memory](https://arxiv.org/abs/2211.00924) | KAIST
4 | - [arXiv 2022.12] [Memories are One-to-Many Mapping Alleviators in Talking Face Generation](https://arxiv.org/pdf/2212.05005.pdf) | Shanghai Jiao Tong University
5 | - [ICCV'23] [EMMN: Emotional Motion Memory Network for Audio-driven Emotional Talking Face Generation](https://openaccess.thecvf.com/content/ICCV2023/papers/Tan_EMMN_Emotional_Motion_Memory_Network_for_Audio-driven_Emotional_Talking_Face_ICCV_2023_paper.pdf) | Shanghai Jiao Tong University
6 | - [arXiv 2023.05] [GeneFace++: Generalized and Stable Real-Time Audio-Driven 3D Talking Face Generation](https://arxiv.org/abs/2305.00787) | Zhejiang University
7 | 


--------------------------------------------------------------------------------
/benchmarks/test.ipynb:
--------------------------------------------------------------------------------
 1 | {
 2 |  "cells": [
 3 |   {
 4 |    "cell_type": "code",
 5 |    "execution_count": 7,
 6 |    "metadata": {},
 7 |    "outputs": [
 8 |     {
 9 |      "name": "stdout",
10 |      "output_type": "stream",
11 |      "text": [
12 |       "49\n"
13 |      ]
14 |     }
15 |    ],
16 |    "source": [
17 |     "import utils\n",
18 |     "\n",
19 |     "sample_transcript_file = \"/home/jason/Downloads/transcript (3).txt\"\n",
20 |     "\n",
21 |     "print(utils.count_sentences(sample_transcript_file))"
22 |    ]
23 |   },
24 |   {
25 |    "cell_type": "code",
26 |    "execution_count": null,
27 |    "metadata": {},
28 |    "outputs": [],
29 |    "source": []
30 |   }
31 |  ],
32 |  "metadata": {
33 |   "kernelspec": {
34 |    "display_name": "pytorch",
35 |    "language": "python",
36 |    "name": "python3"
37 |   },
38 |   "language_info": {
39 |    "codemirror_mode": {
40 |     "name": "ipython",
41 |     "version": 3
42 |    },
43 |    "file_extension": ".py",
44 |    "mimetype": "text/x-python",
45 |    "name": "python",
46 |    "nbconvert_exporter": "python",
47 |    "pygments_lexer": "ipython3",
48 |    "version": "3.9.17"
49 |   }
50 |  },
51 |  "nbformat": 4,
52 |  "nbformat_minor": 2
53 | }
54 | 


--------------------------------------------------------------------------------
/notes/example.md:
--------------------------------------------------------------------------------
 1 | ## Title
 2 | Use 1-2 sentences to highlight the key contributions of the paper.
 3 | 
 4 | **what's new:** The authors (xxx and xxx from xxx) proposed xxx to address/achieve xxx.
 5 | 
 6 | **key insights:** Previous works leverages xxx to achieve xxx but they are limited by xxx. To overcome xxx, authors designed xxx.
 7 | 
 8 | **How it works:** The authors designed three-stage pipeline named xxx to generate xxx. In the first, xxx-1. In the second, xxx-2. In the end, xxx-3.    
 9 | - xxx-1 is trained with xxx. Compared with existing works, xxx is better than xxx. It is because that xxx.
10 | - xxx-2 is trained with xxx. Compared with existing works, xxx is better than xxx. It is because that xxx.
11 | - xxx-3 is trained with xxx. Compared with existing works, xxx is better than xxx. It is because that xxx.
12 | 
13 | **Results:** The authors evaluated xxx on xxx. Compared with xxx, xxx is better on xxx. But it is worse than xxx on xxx. It is because that xxx.
14 | 
15 | **Why it matters:** This work reveals that xxx but xxx. Such insights deepen our understanding of xxx and can help practitioners explain their outputs.
16 | 
17 | **We're thinking:** xxx-1 and xxx-2 may be useful for other pipelines because xxx.
18 | 
19 | 


--------------------------------------------------------------------------------
/notes/StyleHEAT note.md:
--------------------------------------------------------------------------------
 1 | # StyleHEAT
 2 | 
 3 | ## StyleHEAT
 4 | 
 5 | Propose a unified framework based on a pre-trained Style-GAN for one-shot talking face generation,which use audio or video driving a source image.
 6 | 
 7 | **what’s new:** 
 8 | 
 9 | Fei Yin from Tsinghua University and researchers from Tencent AI Lab have released a new generation pipeline named StyleHEAT to synthesis diverse and stylized talking-face videos
10 | 
11 | **key insights:** 
12 | 
13 | Previous work was unable to generate high-resolution speaking videos due to dataset, efficiency, and other limitations. To overcome the low quality of video resolution,the author verified the spatial characteristics of sytleGAN and added it to the generation framework to achieve high-resolution face speaking video generation, and can edit it by attributes.
14 | 
15 | **How it works:** 
16 | 
17 | The authors designed Video-Driven , Autdio-driven Motion Generator and Feature Calibration to achieve the unified framework.
18 | 
19 | - First it obtain the style codes and feature maps of the source image by the encoder of GAN inversion.
20 | - Second the video or audio along with the source image are used to predict motion fields by the corresponding **motion generator**. The motion generator output the desired flow fields for feature warping
21 |     - Driving-Video Motion Generator use 3DMM parameters as  the motion representation,and the network is based on U-Net
22 |     - Driving-Audio Motion Generator use the Mel-Spectrogram,use an MLP to squeeze the temporal dimension. The network is the via AdaIN
23 | - Then,the selected feature map is warped by the motion fields, followed by the **calibration network** for rectifying feature distortions.
24 | - The refined feature map is then fed into the StyleGAN for the final face generation.
25 | 
26 | ![Style](https://github.com/SuperGoodGame/awesome-avatar/blob/main/assets/StyleHEAT.png)
27 | 
28 | **Results:** 
29 | 
30 | Authors train the two motion generators on the VoxCeleb dataset,while joint training the whole framework on the HDTF dataset.This model output is a 1024×1024 image generated by the pre-trained styleGAN . Authors evaluated the generation picture with the other work,containing reappearance of the same character and different characters.Compared with other methods qualitatively and quantitatively, this method generates images that are more natural and have higher resolution. In terms of lip sync, more details and higher resolution than wav2lip.
31 | 
32 | **Why it matters:** 
33 | 
34 | This work proposes a unified framework based on a pre-trained Style-GAN for one-shot high-quality talking face generation.Such insights provide us with a new method and idea to improve the generation resolution.
35 | 
36 | **We’re thinking:** 
37 | 
38 | We can achieve the last step of generating video through GAN inversion-styleGAN inference, replacing the last step FaceRender in sadtalker
39 | 


--------------------------------------------------------------------------------
/advances/tutorials.md:
--------------------------------------------------------------------------------
 1 | # Courses, Talks and Tutorials
 2 | ## Courses
 3 | ### Learning-Based Image Synthesis
 4 | - Basic info: CMU 16-726, Spring 2023;
 5 | - Instructor: [Jun-Yan Zhu](https://www.cs.cmu.edu/~junyanz/);
 6 | - Website: https://learning-image-synthesis.github.io/sp23/
 7 | 
 8 | ## Tutorials
 9 | ### Video Synthesis: Early Days and New Developments
10 | - Conference: ECCV'22 (Oct 24, 2022)
11 | - Organizers: [Sergey Tulyakov @ Snap Research](http://www.stulyakov.com/), [Jian Ren @ Snap Research](https://alanspike.github.io/), [Stéphane Lathuilière @ Telecom Paris](https://stelat.eu/), and [Aliaksandr Siarohin @ Snap Research](https://aliaksandrsiarohin.github.io/aliaksandr-siarohin-website/);
12 | - Website: https://snap-research.github.io/video-synthesis-tutorial/;
13 | 
14 | ### Efficient Neural Networks: From Algorithm Design to Practical Mobile Deployments
15 | - Conference: CVPR'23 (June 18, 2023);
16 | - Organizers: [Jian Ren @ Snap Research](https://alanspike.github.io/), [Sergey Tulyakov @ Snap Research](http://www.stulyakov.com/), and [Eric Hu @ Snap](https://www.linkedin.com/in/erichuju/);
17 | - Website: https://snap-research.github.io/efficient-nn-tutorial/;
18 | 
19 | ### Full-Stack, GPU-based Acceleration of Deep Learning
20 | - Conference: CVPR'23 (June 18, 2023);
21 | - Organizers: [Maying Shen @ NVIDIA](https://mayings.github.io/), [Jason Clemons @ NVIDIA Research](https://scholar.google.com/citations?user=J_1GGJsAAAAJ&hl=zh-CN), [Hongxu (Danny) Yin @ NVIDIA Research](https://hongxu-yin.github.io/), and [Pavlo Molchanov @ NVIDIA Research](https://www.pmolchanov.com/);
22 | - Website: https://nvlabs.github.io/EfficientDL/;
23 | 
24 | ### Denoising Diffusion Models: A Generative Learning Big Bang
25 | - Conference: CVPR'23 (June 18, 2023);
26 | - Organizers: [Jiaming Song @ NVIDIA Research](https://tsong.me/), [Chenlin Meng @ Stanford](https://cs.stanford.edu/~chenlin/), and [Arash Vahdat @ NVIDIA Research](http://latentspace.cc/);
27 | - Website: https://cvpr2023-tutorial-diffusion-models.github.io/;
28 | 
29 | ### Prompting in Vision
30 | - Conference: CVPR'23 (June 18, 2023);
31 | - Organizers: [Kaiyang Zhou @ NTU](https://kaiyangzhou.github.io/), [Ziwei Liu @ NTU](https://liuziwei7.github.io/), [Phillip Isola @ MIT](http://web.mit.edu/phillipi/), [Hyojin Bahng @ MIT](), [Ludwig Schmidt @ UW](https://people.csail.mit.edu/ludwigs/), [Sarah Pratt @ UW](https://sarahpratt.github.io/), and [Denny Zhou @ Google Research](https://dennyzhou.github.io/)
32 | - Website: https://prompting-in-vision.github.io/;
33 | 
34 | ### All Things ViTs: Understanding and Interpreting Attention in Vision
35 | - Conference: CVPR'23 (June 18, 2023);
36 | - Organizers: [Hila Chefer @ Tel Aviv University](https://hila-chefer.github.io/) and [Sayak Paul @ Google Resaearch & Hugging Face](https://sayak.dev/);
37 | - Website: https://all-things-vits.github.io/atv/;
38 | 
39 | ### Vision Transformer: More is different
40 | - Conference: CVPR'23 (June 18, 2023);
41 | - Organizers: [Dacheng Tao @ University of Sydney](https://scholar.google.com/citations?user=RwlJNLcAAAAJ&hl=en), [Qiming Zhang @ University of Sydney](https://scholar.google.com/citations?user=f8rAZ7MAAAAJ&hl=zh-CN), [Yufei Xu](https://scholar.google.com/citations?user=hlYWxX8AAAAJ&hl=zh-CN), and [Jing Zhang @ Renmin University of China](https://xiaojingzi.github.io/);
42 | - Website: https://cvpr2023.thecvf.com/virtual/2023/tutorial/18572;
43 | 
44 | ## Talks
45 | 1. [Interpreting Deep Generative Models
46 | for Interactive AI Content Creation, Bolei Zhou, CUHK.](https://www.youtube.com/watch?v=PtRU2B6Iml4)


--------------------------------------------------------------------------------
/notes/sadtalker.md:
--------------------------------------------------------------------------------
 1 | ## SadTalker: Learning Realistic 3D Motion Coefficients for Stylized Audio-Driven Single Image Talking Face Animation
 2 | A VAE-based poseNet is integrated with a new talking-face generation pipeline SadTalker to generate diverse head motions on talking-face videos.
 3 | 
 4 | ![SadTalker overview](https://github.com/Jason-cs18/awesome-avatar/blob/main/assets/sadtalker.png "SadTalker overview")
 5 | 
 6 | **What's new:** Wenxuan Zhang from Xi'an Jiaotong University and researchers from Tencent AI Lab have released a new generation pipeline named SadTalker to synthesis diverse and stylized talking-face videos.
 7 | 
 8 | **Key insights:** Facial landmarks and keypoints were previously used as the intermediate facial representation in works, but they were found to be difficult to disentangle with expressions and movements. To address this issue, SadTalker leveraged a explicit 3D face model to decouple the representations of expression and head motions. As a result, they designed PoseVAE and ExpNet to learn audio-to-expression and audio-to-pose, respectively.
 9 | 
10 | <!-- Previous works leverages xxx to achieve xxx but they are limited by xxx. To overcome xxx, authors designed xxx. -->
11 | 
12 | **How it works:** As shown in the Figure 2, the authors designed a three-stage inference pipeline to synthesize stylized talking-face videos. 
13 | - They first leveraged the recent single image deep 3D reconstruction method to extract the 3D face model on the target image, which consists of identity coefficients, expression coefficients and head pose (rotation and translation).
14 | - Secondly, they estimated the pose and expression coefficients using PoseVAE and ExpNet, respectively. After that, they obtained a motion flow in the 3D facial space. During training, the authors used reconstruction loss and distillation loss to motivate ExpNet to learn accurate mapping on the entire facial motion and the lip motion, respectively. Similarly, reconstruction loss and KL-divergence loss were used to motivate PoseVAE to learn accurate and diverse head motions, respectively.
15 | - In the third, they leveraged a modified face vid2vid model to render real face from the estimated 3D facial motion. In training, they first trained face vid2vid in a self-supervised fashion. Then they fine-tuned the customized mappingNet in a reconstruction style.
16 | <!-- - they trained ExpNet with the reconstruction loss and distillation loss. The reconstruction loss encouraged the model to learn the accurate mapping in explicit facial motion space and the distillation loss encouraged the model to learn the accurate lip-sync. Similarly, authors trained PoseVAE with the    -->
17 | 
18 | *Note: Training details are introduced in [the supplementary material](https://openaccess.thecvf.com/content/CVPR2023/supplemental/Zhang_SadTalker_Learning_Realistic_CVPR_2023_supplemental.pdf).*
19 | 
20 | **Results:** The authors trained SadTalker on VoxCeleb1 with 8 NVIDIA A100 GPUs and tested it on HDTF dataset. Compared with other competitors, it generated a better head motion and more realistic face. But its lip-sync is bad than other methods.
21 | 
22 | <!-- The authors evaluated xxx on xxx. Compared with xxx, xxx is better on xxx. But it is worse than xxx on xxx. It is because that xxx. -->
23 | 
24 | **Why it matters:** This study demonstrated that an explicit 3D face model was a more accurate intermediate representation of the facial characteristics than previous methods. Additionally, the use of VAE models for estimating head pose from audio was shown to improve the naturalness of the synthesis videos. These findings have the potential to enhance the quality of talking-face pipelines for machine learning researchers.
25 | 
26 | <!-- This work reveals that xxx but xxx. Such insights deepen our understanding of xxx and can help practitioners explain their outputs. -->
27 | 
28 | **We're thinking:** Instead of using Wav2Lip as a teacher model in ExpNet training, can we use Wav2Lip as ExpNet directly? The 3D-aware face render should have similar capabilities to face vid2vid, allowing for fine-tuning of the render with 10~30s video clips of a target person to improve lip-sync results.
29 | 
30 | *Note: Few-shot learning is widely studied in deep generation models. Details can refer to [fs-vid2vid, NeurIPS'19, NVIDIA](https://nvlabs.github.io/few-shot-vid2vid/) and [face-few-shot, ICCV'19, Samsung AI](https://openaccess.thecvf.com/content_ICCV_2019/papers/Zakharov_Few-Shot_Adversarial_Learning_of_Realistic_Neural_Talking_Head_Models_ICCV_2019_paper.pdf).*


--------------------------------------------------------------------------------
/advances/iccv23_papers.md:
--------------------------------------------------------------------------------
 1 | # ICCV'23, Oct 4-6, 2023
 2 | In this page, we provide a short review of recent advances (21 papers) on the digital human, including 3D face reconstruction, talking-face synthesis, talking-body synthesis, and talking-face video editing.
 3 | 
 4 | ## Workshops and Tutorials
 5 | 1. To NeRF or not to NeRF: 
 6 | A View Synthesis Challenge for Human Heads https://sites.google.com/view/vschh/home
 7 | 2. The 11th IEEE International Workshop on Analysis and Modeling of Faces and Gestures https://web.northeastern.edu/smilelab/amfg2023/
 8 | ## 3D Face Reconstruction (3)
 9 | 
10 | ![](https://project-hiface.github.io/img/detail.gif)
11 | 
12 | <!-- <video src='https://project-hiface.github.io/video/video_demo.mp4' width=/> -->
13 | 
14 | - [Speech4Mesh: Speech-Assisted Monocular 3D Facial Reconstruction for Speech-Driven 3D Facial Animation](https://openaccess.thecvf.com/content/ICCV2023/papers/He_Speech4Mesh_Speech-Assisted_Monocular_3D_Facial_Reconstruction_for_Speech-Driven_3D_Facial_ICCV_2023_paper.pdf), University of Science and Technology of China, [FLAME-Universe (3DMM alternative)](https://github.com/TimoBolkart/FLAME-Universe).
15 | - [HiFace: High-Fidelity 3D Face Reconstruction by
16 | Learning Static and Dynamic Details](https://openaccess.thecvf.com/content/ICCV2023/papers/Chai_HiFace_High-Fidelity_3D_Face_Reconstruction_by_Learning_Static_and_Dynamic_ICCV_2023_paper.pdf), National University of Singapore, [ProjectPage](https://project-hiface.github.io/). 
17 | - [ASM: Adaptive Skinning Model for High-Quality 3D Face Modeling](https://openaccess.thecvf.com/content/ICCV2023/papers/Yang_ASM_Adaptive_Skinning_Model_for_High-Quality_3D_Face_Modeling_ICCV_2023_paper.pdf), Tencent AI Lab.
18 | ## Talking-Face Synthesis
19 | 
20 | ### Metrics and Benchchmarks (1)
21 | - [On the Audio-visual Synchronization for Lip-to-Speech Synthesis](https://openaccess.thecvf.com/content/ICCV2023/papers/Niu_On_the_Audio-visual_Synchronization_for_Lip-to-Speech_Synthesis_ICCV_2023_paper.pdf), The Hong Kong University of Science and Technology.
22 | 
23 | ### 2D Talking-Face Synthesis (11)
24 | *keywords: emotion, diffusion priors, memory network, StyleGAN2*;
25 | 
26 | ![](https://github.com/StelaBou/HyperReenact/raw/master/images/architecture.png)
27 | 
28 | - [Emotional Listener Portrait: Realistic Listener Motion Simulation in Conversation](https://openaccess.thecvf.com/content/ICCV2023/html/Song_Emotional_Listener_Portrait_Neural_Listener_Head_Generation_with_Emotion_ICCV_2023_paper.html), University of Rochester.​
29 | - [Talking Head Generation with Probabilistic Audio-to-Visual Diffusion Priors](https://openaccess.thecvf.com/content/ICCV2023/html/Yu_Talking_Head_Generation_with_Probabilistic_Audio-to-Visual_Diffusion_Priors_ICCV_2023_paper.html), Xiaobing.AI, [ProjectPage](https://zxyin.github.io/TH-PAD/).​
30 | - [Efficient Emotional Adaptation for Audio-Driven Talking-Head Generation](https://openaccess.thecvf.com/content/ICCV2023/papers/Gan_Efficient_Emotional_Adaptation_for_Audio-Driven_Talking-Head_Generation_ICCV_2023_paper.pdf), Zhejiang University, [ProjectPage](https://yuangan.github.io/eat/), [Code](https://github.com/yuangan/EAT_code), [Blog](https://zhuanlan.zhihu.com/p/658569026).​
31 | - [Implicit Identity Representation Conditioned Memory Compensation Network for Talking Head Video Generation](https://openaccess.thecvf.com/content/ICCV2023/papers/Hong_Implicit_Identity_Representation_Conditioned_Memory_Compensation_Network_for_Talking_Head_ICCV_2023_paper.pdf), The Hong Kong University of Science and Technology, [ProjectPage](https://harlanhong.github.io/publications/mcnet.html), [Code](https://github.com/harlanhong/ICCV2023-MCNET).​
32 | - [ToonTalker: Cross-Domain Face Reenactment](https://openaccess.thecvf.com/content/ICCV2023/papers/Gong_ToonTalker_Cross-Domain_Face_Reenactment_ICCV_2023_paper.pdf), Tsinghua University, [ProjectPage](https://opentalker.github.io/ToonTalker/), [Code](https://github.com/yuanygong/ToonTalker).​
33 | - [Robust One-Shot Face Video Re-enactment using Hybrid Latent Spaces of StyleGAN2](https://openaccess.thecvf.com/content/ICCV2023/papers/Oorloff_Robust_One-Shot_Face_Video_Re-enactment_using_Hybrid_Latent_Spaces_of_ICCV_2023_paper.pdf), University of Maryland, [ProjectPage](https://trevineoorloff.github.io/FaceVideoReenactment_HybridLatents.io/).
34 | - [EMMN: Emotional Motion Memory Network for Audio-driven Emotional Talking Face Generation](https://openaccess.thecvf.com/content/ICCV2023/papers/Tan_EMMN_Emotional_Motion_Memory_Network_for_Audio-driven_Emotional_Talking_Face_ICCV_2023_paper.pdf), Shanghai Jiao Tong University.
35 | - [EmoTalk: Speech-Driven Emotional Disentanglement for 3D Face Animation](https://openaccess.thecvf.com/content/ICCV2023/papers/Peng_EmoTalk_Speech-Driven_Emotional_Disentanglement_for_3D_Face_Animation_ICCV_2023_paper.pdf), Renmin University of China, [ProjectPage](https://ziqiaopeng.github.io/emotalk/), [Code](https://github.com/psyai-net/EmoTalk_release).
36 | - [HyperReenact: One-Shot Reenactment via Jointly Learning to Refine and
37 | Retarget Faces](https://openaccess.thecvf.com/content/ICCV2023/papers/Bounareli_HyperReenact_One-Shot_Reenactment_via_Jointly_Learning_to_Refine_and_Retarget_ICCV_2023_paper.pdf), Kingston University London, [Code](https://github.com/StelaBou/HyperReenact).
38 | - [MODA: Mapping-Once Audio-driven Portrait Animation with Dual Attentions](https://openaccess.thecvf.com/content/ICCV2023/papers/Liu_MODA_Mapping-Once_Audio-driven_Portrait_Animation_with_Dual_Attentions_ICCV_2023_paper.pdf), International Digital Economy Academy (IDEA), [ProjectPage](https://liuyunfei.net/projects/iccv23-moda/), [Code](https://github.com/DreamtaleCore/MODA).
39 | - [SPACE: Speech-driven Portrait Animation with Controllable Expression](https://openaccess.thecvf.com/content/ICCV2023/papers/Gururani_SPACE_Speech-driven_Portrait_Animation_with_Controllable_Expression_ICCV_2023_paper.pdf), NVIDIA, [ProjectPage](https://research.nvidia.com/labs/dir/space/).
40 | ### 3D Talking-Face Synthesis (1)
41 | *keywords: efficiency*;
42 | 
43 | - [Efficient Region-Aware Neural Radiance Fields for High-Fidelity Talking Portrait Synthesis](https://openaccess.thecvf.com/content/ICCV2023/html/Li_Efficient_Region-Aware_Neural_Radiance_Fields_for_High-Fidelity_Talking_Portrait_Synthesis_ICCV_2023_paper.html), Beihang University, [Code](https://github.com/Fictionarry/ER-NeRF).
44 | 
45 | ## Talking-Face Video Editing (1)
46 | 
47 | *keywords: GAN inversion, StyleGAN*;
48 | 
49 | - [RIGID: Recurrent GAN Inversion and Editing of Real Face Videos](https://openaccess.thecvf.com/content/ICCV2023/papers/Xu_RIGID_Recurrent_GAN_Inversion_and_Editing_of_Real_Face_Videos_ICCV_2023_paper.pdf), The University of Hong Kong, [ProjectPage](https://cnnlstm.github.io/RIGID/), [Code](https://github.com/cnnlstm/RIGID).
50 | 
51 | ## Talking-Body Synthesis (4)
52 | 
53 | *keywords: continual learning, lively, one-shot*;
54 | 
55 | - [Continual Learning for Personalized Co-Speech Gesture Generation](https://openaccess.thecvf.com/content/ICCV2023/html/Ahuja_Continual_Learning_for_Personalized_Co-speech_Gesture_Generation_ICCV_2023_paper.html), CMU, [ProjectPage](https://chahuja.com/cdiffgan/), [Dataset](https://chahuja.com/pats/).​
56 | - [LivelySpeaker: Towards Semantic-Aware Co-Speech Gesture Generation](https://openaccess.thecvf.com/content/ICCV2023/html/Zhi_LivelySpeaker_Towards_Semantic-Aware_Co-Speech_Gesture_Generation_ICCV_2023_paper.html), ShanghaiTech University, [Code](https://github.com/zyhbili/LivelySpeaker). ​
57 | - [DINAR: Diffusion Inpainting of Neural Textures for One-Shot Human Avatars](https://openaccess.thecvf.com/content/ICCV2023/html/Svitov_DINAR_Diffusion_Inpainting_of_Neural_Textures_for_One-Shot_Human_Avatars_ICCV_2023_paper.html), Samsung AI Center, [Code](https://github.com/SamsungLabs/DINAR).​
58 | - [One-shot Implicit Animatable Avatars with Model-based Priors](https://openaccess.thecvf.com/content/ICCV2023/html/Huang_One-shot_Implicit_Animatable_Avatars_with_Model-based_Priors_ICCV_2023_paper.html), Zhejiang University, [Code](https://github.com/huangyangyi/ELICIT).
59 | 


--------------------------------------------------------------------------------
/benchmarks/readme.md:
--------------------------------------------------------------------------------
  1 | # A Microbenchmark for Talking-Face Synthesis
  2 | ### [**Dataset**](https://drive.google.com/drive/folders/1vBse3rgHd3JfTGNFXC-oUZs5DR9B5Mep?usp=sharing) | [**Website**](https://jason-cs18.github.io/awesome-avatar/benchmarks/)
  3 | 
  4 | This repository contains the datasets and testing scripts for talking-face synthesis.
  5 | 
  6 | > A microbenchmark serves as a valuable tool for researchers to conduct speedy evaluations of new algorithms. This repository can be easily customized and applied to diverse audio-visual talking-face datasets.
  7 | 
  8 | ### Datasets
  9 | In this benchmark, we collect 3 videos for English speakers and 3 videos for Chinese speakers.
 10 | 
 11 | <!-- <img src="https://github.com/Jason-cs18/awesome-avatar/blob/main/benchmarks/assets/file_structure.png"/>
 12 | 
 13 | ![File Structure](https://github.com/Jason-cs18/awesome-avatar/blob/main/benchmarks/assets/file_structure.png "Magic Gardens") -->
 14 | 
 15 | <!-- ![](https://github.com/Jason-cs18/awesome-avatar/blob/main/benchmarks/assets/file_structure.png) -->
 16 | 
 17 | #### File Structure
 18 | ```
 19 | ├── driving_audios
 20 | | ├── [9.3M] may_english_audio.aac
 21 | | ├── [3.3M] macron_english_trim_audio.aac
 22 | | ├── [3.5M] obama1_english_audio.aac
 23 | | ├── [780K] laoliang_chinese_50s_audio.mp3
 24 | | ├── [4.3M] luoxiang_chinese_audio.mp3
 25 | | ├── [8.9M] zuijiapaidang_chinese_audio.mp3
 26 | ├── source_images
 27 | | ├── [294K] may.png
 28 | | ├── [202K] macron.png
 29 | | ├── [213M] obama1.png
 30 | | ├── [206K] zuijiapaidang.png
 31 | | ├── [175K] luoxiang.png
 32 | | ├── [204K] laoliang.png
 33 | ├── reference_videos
 34 | │ ├── [56M] obama1_english.mp4, 03:38.16, 25fps, 450x450, 46 sentences
 35 | │ ├── [96M] may_english.mp4, 04:02.97, 25fps, 512x512, 35 sentences
 36 | │ ├── [24M] macron_english_trim.mp4, 00:03:31.92, 25fps, 512x512, 49 sentences
 37 | │ ├── [3.6M] laoliang_chinese_50s.mp4, 00:00:49.85, 30fps, 410x380, 40 sentences
 38 | │ ├── [14M] luoxiang_chinese.mp4, 04:40.01, 25fps, 350x500, 32 sentences
 39 | │ ├── [28M] zuijiapaidang_chinese.mp4, 09:41.98, 30fps, 460x450, 85 sentences
 40 | ```
 41 | 
 42 | <table>
 43 | 	<tr>
 44 | 	    <th colspan="3"><center>English Speakers</center></th>
 45 |     	<tr>
 46 | 	    	<td ><center>obama1_english.mp4</center></td>
 47 | 	    	<td><center>may_english.mp4</center></td>
 48 |         	<td><center>macron_english.mp4</center></td>
 49 | 		</tr >
 50 |     </tr >
 51 |     	<tr>
 52 | 	    	<td><iframe src="https://drive.google.com/file/d/1g-T1nvL0KqBkInIRVSSbOvmC1LiCB36o/preview"></iframe></td>
 53 | 	    	<td><iframe src="https://drive.google.com/file/d/1UMQZP7j8ORLJpHYiUMc-FexDp_SX7386/preview"></iframe></td>
 54 |         	<td><iframe src="https://drive.google.com/file/d/1ReG45fm8wnz_a3ZJ3qOhPJGgS8LywKaS/preview"></iframe></td>
 55 | 		</tr >
 56 |     <tr>
 57 | 	    <th colspan="3"><center>Chinese Speakers</center></th>
 58 |     	<tr>
 59 | 	    	<td ><center>laoliang_chinese.mp4</center></td>
 60 | 	    	<td><center>luoxiang_chinese.mp4</center></td>
 61 | 	    	<td><center>zuijiapaidang_chinese.mp4</center></td>
 62 | 		</tr >
 63 |     </tr >
 64 |     	<tr>
 65 | 	    	<td><iframe src="https://drive.google.com/file/d/1jk9gX2R7KcD_Q2WF-zs7e2Es3lfKBCpK/preview"></iframe></td>
 66 | 	    	<td><iframe src="https://drive.google.com/file/d/1d1haMYyA9mH0Wc1NgkEAuHtk30KpLJME/preview"></iframe></td>
 67 |         	<td><iframe src="https://drive.google.com/file/d/1H-DhAj2K8EESbCUWvr6ylcUqKIFVJ94k/preview"></iframe></td>
 68 | 		</tr >
 69 | </table>
 70 | 
 71 | ### Benchmark
 72 | To measure the performance of Wav2Lip and SadTalker, we run them on all videos and testing with the following metrics: 
 73 | - **Sync↑**: The confidence score from SyncNet (lip-sync);
 74 | - **PSNR↑**: Peak signal-to-noise ratio (identity-preserving);
 75 | - **SSIM↑**: Structural similarity for image (identity-preserving);
 76 | - **FID↓**: Frchet inception distance (image quality);
 77 | 
 78 | ### Implementation (off-the-shelf tools)
 79 | 1. Sync: [syncnet_python](https://github.com/joonson/syncnet_python) ![Github stars](https://img.shields.io/github/stars/joonson/syncnet_python.svg) 
 80 | 2. PSNR, SSIM: [ffmpeg-quality-metrics](https://github.com/slhck/ffmpeg-quality-metrics) ![Github stars](https://img.shields.io/github/stars/slhck/ffmpeg-quality-metrics.svg) 
 81 | 3. FID, ~~IS~~: [IQA-PyTorch](https://github.com/chaofengc/IQA-PyTorch) ![Github stars](https://img.shields.io/github/stars/chaofengc/IQA-PyTorch.svg)  
 82 | 
 83 | 
 84 | ### Qualitative Results for One-shot Pipelines
 85 | 
 86 | 
 87 | <table>
 88 | 	<tr>
 89 | 	    <th colspan="3"><center>English Speakers</center></th>
 90 |     	<tr>
 91 | 	    	<td ><center>obama1_Wav2Lip.mp4<br><b>PSNR:</b> 32.287, <b>SSIM:</b> 0.951, <b>FID:</b> 18.993</center></td>
 92 | 	    	<td><center>may_Wav2Lip.mp4<br><b>PSNR:</b> 32.572, <b>SSIM:</b> 0.936, <b>FID:</b> 33.941</center></td>
 93 |         	<td><center>macron_Wav2Lip.mp4<br><b>PSNR:</b> 35.737, <b>SSIM:</b> 0.969, <b>FID:</b> 6.121</center></td>
 94 | 		</tr >
 95 |     </tr >
 96 | 	<tr>
 97 |     	<tr>
 98 | 	    	<td><iframe src="https://drive.google.com/file/d/159jlICcQEs5A-_bxnH752fjL49P4uzuw/preview"></iframe></td>
 99 | 	    	<td><iframe src="https://drive.google.com/file/d/195V0U8rjnce4aujAI2AZhpCwqKddXHGA/preview"></iframe></td>
100 |         	<td><iframe src="https://drive.google.com/file/d/1Z0bIbqmVgNdECxgYLedUPVpW6uwquE1z/preview"></iframe></td>
101 | 		</tr >
102 |     </tr >
103 |     <tr>
104 | 	    <th colspan="3"><center>Chinese Speakers</center></th>
105 |     	<tr>
106 | 	    	<td ><center>laoliang_Wav2Lip.mp4<br><b>PSNR:</b> 31.444, <b>SSIM:</b> 0.939, <b>FID:</b> 19.192</center></td>
107 | 	    	<td><center>luoxiang_Wav2Lip.mp4<br><b>PSNR:</b> 34.367, <b>SSIM:</b> 0.971, <b>FID:</b> 23.631</center></td>
108 | 	    	<td><center>zuijiapaidang_Wav2Lip.mp4<br><b>PSNR:</b> 20.364, <b>SSIM:</b> 0.783, <b>FID:</b> 49.04</center></td>
109 | 		</tr >
110 | 		<tr>
111 | 	    	<td><iframe src="https://drive.google.com/file/d/1SKfceJZ_142bETjqc-FyCtem-SSFlWI4/preview"></iframe></td>
112 | 	    	<td><iframe src="https://drive.google.com/file/d/15Dt0-5rRbWiYDW4GuzfZGxK8ndjk2MOy/preview"></iframe></td>
113 |         	<td><iframe src="https://drive.google.com/file/d/12iFMIexJkpG9dDmatfFD9yd-LG-bk1dw/preview"></iframe></td>
114 | 		</tr >
115 |     </tr >
116 | </table>
117 | 
118 | <table>
119 | 	<tr>
120 | 	    <th colspan="3"><center>English Speakers</center></th>
121 |     	<tr>
122 | 	    	<td ><center>obama1_SadTalker.mp4<br><b>PSNR:</b> 20.587, <b>SSIM:</b> 0.754, <b>FID:</b> 24.051</center></td>
123 | 	    	<td><center>may_SadTalker.mp4<br><b>PSNR:</b> 19.211, <b>SSIM:</b> 0.701, <b>FID:</b> 46.182</center></td>
124 |         	<td><center>macron_SadTalker.mp4<br><b>PSNR:</b> 18.729, <b>SSIM:</b> 0.763, <b>FID:</b> 98.982</center></td>
125 | 		</tr >
126 |     </tr >
127 | 	<tr>
128 |     	<tr>
129 | 	    	<td><iframe src="https://drive.google.com/file/d/1xw0gsxCIGJOKpdAudHM1M5mc7qFaQnBv/preview"></iframe></td>
130 | 	    	<td><iframe src="https://drive.google.com/file/d/1wAFcDyK_Yma4pBHNQZAUJzWEzIsL6rS0/preview"></iframe></td>
131 |         	<td><iframe src="https://drive.google.com/file/d/1y8NmIkXmgCXYKXxJKAEhYwjsh1LSiTiq/preview"></iframe></td>
132 | 		</tr >
133 |     </tr >
134 |     <tr>
135 | 	    <th colspan="3"><center>Chinese Speakers</center></th>
136 |     	<tr>
137 | 	    	<td ><center>laoliang_SadTalker.mp4<br><b>PSNR:</b> 18.536, <b>SSIM:</b> 0.672, <b>FID:</b> 52.362</center></td>
138 | 	    	<td><center>luoxiang_SadTalker.mp4<br><b>PSNR:</b> 14.363, <b>SSIM:</b> 0.598, <b>FID:</b> 104.221</center></td>
139 | 	    	<td><center>zuijiapaidang_SadTalker.mp4<br><b>PSNR:</b> 17.359, <b>SSIM:</b> 0.725, <b>FID:</b> 4.781</center></td>
140 | 		</tr >
141 | 		<tr>
142 | 	    	<td><iframe src="https://drive.google.com/file/d/1i5fu_iYkg98a6vRvPw7tg8Z2mRvp4PV3/preview"></iframe></td>
143 | 	    	<td><iframe src="https://drive.google.com/file/d/1Ln5WBpa2PMWT0vDMfB0M_Una_o5j2QL3/preview"></iframe></td>
144 |         	<td><iframe src="https://drive.google.com/file/d/1m8itAbvVVi5kx67_00mUo7vpTGs0gwpw/preview"></iframe></td>
145 | 		</tr >
146 |     </tr >
147 | </table>
148 | 
149 | ### Quantitative Results for One-shot Pipelines
150 | 
151 | <table>
152 | 	<tr>
153 | 	    <th colspan="5"><center>English Speakers</center></th> <th colspan="5"><center>Chinese Speakers</center></th>
154 |     	<tr>
155 | 	    	<td >Pipeline</td>
156 | 	    	<td>Sync↑</td>
157 | 	    	<td>PSNR↑</td>
158 |         	<td>SSIM↑</td>
159 | 			<td>FID↓</td>
160 | 			<td >Pipeline</td>
161 | 	    	<td>Sync↑</td>
162 | 	    	<td>PSNR↑</td>
163 |         	<td>SSIM↓</td>
164 | 			<td>FID↓</td>
165 | 		</tr >
166 | 		<tr>
167 | 	    	<td >Wav2Lip</td>
168 | 	    	<td>xxx</td>
169 | 	    	<td>33.532</td>
170 |         	<td>0.952</td>
171 | 			<td>19.685</td>
172 |         	<!-- <td>xxx</td> -->
173 | 	    	<td >Wav2Lip</td>
174 | 			<td>xxx</td>
175 | 	    	<td>28.725</td>
176 |         	<td>0.897</td>
177 | 			<td>30.621</td>
178 |         	<!-- <td>xxx</td> -->
179 | 		</tr >
180 | 		<tr>
181 | 	    	<td >SadTalker</td>
182 | 	    	<td>xxx</td>
183 | 	    	<td>19.509</td>
184 |         	<td>0.739</td>
185 | 			<td>56.407</td>
186 |         	<!-- <td>xxx</td> -->
187 | 	    	<td >SadTaler</td>
188 | 	    	<td>xxx</td>
189 | 	    	<td>16.753</td>
190 |         	<td>0.665</td>
191 | 			<td>68.120</td>
192 |         	<!-- <td>xxx</td> -->
193 | 		</tr >
194 | 	</tr >
195 | </table>
196 | 
197 | 
198 | *Because NeRF based renderers (GeneFace and ER-NeRF) are person-dependent, we train them on the first 3 minutes of marcon and zuijiapaidang respectively.*
199 | 
200 | 
201 | ### Qualitative Results for Few-shot Pipelines
202 | 
203 | <table>
204 | 	<tr>
205 | 	    <th colspan="2"><center>English Speakers</center></th>
206 |     	<tr>
207 | 	    	<td >marcon_GeneFace.mp4</td>
208 | 	    	<td>macron_ER-NeRF.mp4</td>
209 | 		</tr >
210 |     <tr>
211 | 	    <th colspan="2"><center>Chinese Speakers</center></th>
212 |     	<tr>
213 | 	    	<td >zuijiapaidang_GeneFace.mp4</td>
214 | 	    	<td>zuijiapaidang_ER-NeRF.mp4</td>
215 | 		</tr >
216 |     </tr >
217 | </table>
218 | 
219 | 
220 | ### Quantitative Results for Few-shot Pipelines
221 | 
222 | 
223 | <table>
224 | 	<tr>
225 | 	    <th colspan="6"><center>marcon (English)</center></th><th colspan="6"><center>zuijiapaidang (Chinese)</center></th>
226 |     	<tr>
227 | 	    	<td >Pipeline</td>
228 | 	    	<td>Sync↑</td>
229 | 	    	<td>PSNR↑</td>
230 |         	<td>SSIM↓</td>
231 | 			<td>FID↓</td>
232 |         	<td>IS↑</td>
233 | 			<td >Pipeline</td>
234 | 	    	<td>Sync↑</td>
235 | 	    	<td>PSNR↑</td>
236 |         	<td>SSIM↓</td>
237 | 			<td>FID↓</td>
238 |         	<td>IS↑</td>
239 | 		</tr >
240 | 		<tr>
241 | 	    	<td >GeneFace</td>
242 | 	    	<td>xxx</td>
243 | 	    	<td>xxx</td>
244 |         	<td>xxx</td>
245 | 			<td>xxx</td>
246 |         	<td>xxx</td>
247 | 	    	<td >GeneFace</td>
248 | 	    	<td>xxx</td>
249 | 	    	<td>xxx</td>
250 |         	<td>xxx</td>
251 | 			<td>xxx</td>
252 |         	<td>xxx</td>
253 | 		</tr >
254 | 		<tr>
255 | 	    	<td >ER-NeRF</td>
256 | 	    	<td>xxx</td>
257 | 	    	<td>xxx</td>
258 |         	<td>xxx</td>
259 | 			<td>xxx</td>
260 |         	<td>xxx</td>
261 | 	    	<td >ER-NeRF</td>
262 | 	    	<td>xxx</td>
263 | 	    	<td>xxx</td>
264 |         	<td>xxx</td>
265 | 			<td>xxx</td>
266 |         	<td>xxx</td>
267 | 		</tr >
268 | </table>
269 | 
270 | 
271 | ## External Links
272 | 1. [Extract Frames using FFmpeg: A Comprehensive Guide](https://ottverse.com/extract-frames-using-ffmpeg-a-comprehensive-guide/)
273 | 2. [Whisper Web: ML-powered speech recognition directly in your browser](https://huggingface.co/spaces/Xenova/whisper-web)
274 | 3. [moviepy.video.fx.all.crop](https://zulko.github.io/moviepy/ref/videofx/moviepy.video.fx.all.crop.html)
275 | 4. [Trim Video: Trim or cut video of any format](https://online-video-cutter.com/)


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # awesome-avatar
  2 | This is a repository for organizing papers, codes and other resources related to the topic of Avatar (talking-face and talking-body). 
  3 | 
  4 | #### 🔆 This project is still on-going, pull requests are welcomed!!
  5 | If you have any suggestions (missing papers, new papers, key researchers or typos), please feel free to edit and pull a request.
  6 | 
  7 | #### News
  8 | - **2024.09.07**: add ASR and TTS tool
  9 | - **2024.08.24**: add backgrounds for image/video generations
 10 | - **2024.08.24**: re-organize paper list with table formating
 11 | - **2024.08.24**: add works about full-body avatar synthesis
 12 | 
 13 | 
 14 | #### TO DO LIST
 15 | 
 16 | - [x] Main paper list
 17 | - [x] Researchers list
 18 | - [x] Toolbox for avatar
 19 | - [x] Add paper link
 20 | - [ ] Add [paper notes](https://github.com/Jason-cs18/awesome-avatar/tree/main/notes)
 21 | - [x] Add codes if have
 22 | - [x] Add project page if have
 23 | - [x] Datasets and metrics
 24 | - [x] Related links
 25 | 
 26 | ## Researchers and labs
 27 | 1. [NVIDIA Research](https://www.nvidia.com/en-us/research/)
 28 |    - Neural rendering models for human generation: [vid2vid NeurIPS'18](https://tcwang0509.github.io/vid2vid/), [fs-vid2vid NeurIPS'19](https://nvlabs.github.io/few-shot-vid2vid/), [EG3D CVPR'22](https://github.com/NVlabs/eg3d);
 29 |    - Talking-face synthesis: [face-vid2vid CVPR'21](https://nvlabs.github.io/face-vid2vid/), [Implicit NeurIPS'22](https://research.nvidia.com/labs/dir/implicit_warping/), [SPACE ICCV'23](https://research.nvidia.com/labs/dir/space/), [One-shot
 30 | Neural Head Avatar arXiv'23](https://research.nvidia.com/labs/lpr/one-shot-avatar/);
 31 |    - Talking-body synthesis: [DreamPose ICCV'23](https://grail.cs.washington.edu/projects/dreampose/);
 32 |    - Face enhancement (relighting, restoration, etc): [Lumos SIGGRAPH Asia 2022](https://research.nvidia.com/labs/dir/lumos/), [RANA ICCV'23](https://nvlabs.github.io/RANA/);
 33 |    - Authorized use of synthetic videos: [Avatar Fingerprinting arXiv'23](https://research.nvidia.com/labs/nxp/avatar-fingerprinting/);
 34 | 2. [Aliaksandr Siarohin @ Snap Research](https://research.snap.com/team/team-member.html#aliaksandr-siarohin)
 35 |    - Neural rendering models for human generation (focus on flow-based generative models): [Unsupervised-Volumetric-Animation CVPR'23](https://github.com/snap-research/unsupervised-volumetric-animation), [3DAvatarGAN CVPR'23](https://arxiv.org/abs/2301.02700), [3D-SGAN ECCV'22](https://arxiv.org/abs/2112.01422), [Articulated-Animation CVPR'21](https://arxiv.org/abs/2104.11280), [Monkey-Net CVPR'19](https://arxiv.org/abs/1812.08861), [FOMM NeurIPS'19](http://papers.nips.cc/paper/8935-first-order-motion-model-for-image-animation);
 36 | 3. [Ziwei Liu @ Nanyang Technological University](https://liuziwei7.github.io/index.html)
 37 |    - Talking-face synthesis: [StyleSync CVPR'23](https://hangz-nju-cuhk.github.io/projects/StyleSync), [AV-CAT SIGGRAPH Asia 2022](https://hangz-nju-cuhk.github.io/projects/AV-CAT), [StyleGANX ICCV'23](https://www.mmlab-ntu.com/project/styleganex/), [StyleSwap ECCV'22](https://hangz-nju-cuhk.github.io/projects/StyleSwap), [PC-AVS CVPR'21](https://hangz-nju-cuhk.github.io/projects/PC-AVS), [Speech2Talking-Face IJCAI'21](https://www.ijcai.org/proceedings/2021/0141.pdf), [VToonify SIGGRAPH Asia 2022](https://www.youtube.com/watch?v=0_OmVhDgYuY);
 38 |    - Talking-body synthesis: [MotionDiffuse arXiv'22](https://mingyuan-zhang.github.io/projects/MotionDiffuse.html);
 39 |    - Face enhancement (relighting, restoration, etc): [Relighting4D ECCV'22](https://www.youtube.com/watch?v=NayAw89qtsY);
 40 | 4. [Xiaodong Cun @ Tencent AI Lab](https://vinthony.github.io/academic/): 
 41 |    - Talking-face synthesis: [StyleHEAT ECCV'22](https://arxiv.org/abs/2203.04036), [VideoReTalking SIGGRAPH Asia'22](https://arxiv.org/abs/2211.14758), [ToolTalking ICCV'23](https://arxiv.org/abs/2308.12866), [DPE CVPR'23](https://arxiv.org/abs/2301.06281), [CodeTalker CVPR'23](https://arxiv.org/abs/2301.06281), [SadTalker CVPR'23](https://arxiv.org/abs/2211.12194);
 42 |    - Talking-body synthesis: [LivelySpeaker ICCV'23](https://arxiv.org/abs/2306.00926);
 43 | <!-- 5. [Gordon Wetzstein @ Stanford University](https://stanford.edu/~gordonwz/):
 44 |    - 3D face models (*e.g.,* 3DMM): [SSIF SIGGRAPH'23](https://research.nvidia.com/labs/toronto-ai/ssif/)  -->
 45 | 5. Max Planck Institute for Informatics:
 46 |     - 3D face models (*e.g.,* 3DMM): [FLAME SIGGRAPH Asia 2017](https://flame.is.tue.mpg.de/);
 47 | 
 48 | ## Papers
 49 | 
 50 | ### Image and video generation
 51 | |Model|Paper|Blog|Codebase|Note|
 52 | |:---:|:---:|:---:|:---:|:---:|
 53 | |StyleGANv3|[Alias-Free Generative Adversarial Networks](https://nvlabs.github.io/stylegan3/), NVIDIA, NeurIPS 2021|[The Evolution of StyleGAN: Introduction](https://blog.paperspace.com/evolution-of-stylegan/)|[Code](https://github.com/NVlabs/stylegan3)|high fidlity face generation|
 54 | |EG3D|[EG3D: Efficient Geometry-aware 3D Generative Adversarial Networks](https://nvlabs.github.io/eg3d/), NVIDIA, CVPR 2022||[Code](https://github.com/NVlabs/eg3d)|3D-aware GAN|
 55 | |Stable Diffusion|[High-Resolution Image Synthesis with Latent Diffusion Models](https://arxiv.org/pdf/2112.10752), Heidelberg University, CVPR 2022|[What are Diffusion Models?](https://lilianweng.github.io/posts/2021-07-11-diffusion-models/)|[Code](https://github.com/CompVis/latent-diffusion)|diverse and high quality images|
 56 | |Stable Video Diffusion|[Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets](https://arxiv.org/abs/2311.15127), Stability AI, arXiv 2023|[Diffusion Models for Video Generation](https://lilianweng.github.io/posts/2024-04-12-diffusion-video/)|[Code](https://github.com/Stability-AI/generative-models)||
 57 | |DiT|[Scalable Diffusion Models with Transformers](https://arxiv.org/abs/2212.09748), Meta, ICCV 2023|[Diffusion Transformed](https://www.deeplearning.ai/the-batch/a-new-class-of-diffusion-models-based-on-the-transformer-architecture/)|[Code](https://github.com/facebookresearch/DiT)|magic behind OpenAI Sora|
 58 | |VQ-VAE|[Neural Discrete Representation Learning](https://arxiv.org/pdf/1711.00937), DeepMind, NIPS 2017|[OpenAI's DALL-E 2 and DALL-E 1 Explained](https://vaclavkosar.com/ml/openai-dall-e-2-and-dall-e-1)||magic behinds OpenAI DALL-E|
 59 | |NeRF|[NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis](https://arxiv.org/abs/2003.08934), UC Berkeley, ECCV 2020|[NeRF Explosion 2020](https://dellaert.github.io/NeRF/)|[Code](https://github.com/yenchenlin/nerf-pytorch)|3D synthesis via volume rendering|
 60 | |3DGS|[3D Gaussian Splatting for Real-Time Radiance Field Rendering](https://arxiv.org/abs/2308.04079), Inria, SIGGRAPH 2023|[A Comprehensive Overview of Gaussian Splatting](https://towardsdatascience.com/a-comprehensive-overview-of-gaussian-splatting-e7d570081362)|[Code](https://github.com/graphdeco-inria/gaussian-splatting)|real-time 3d rendering|
 61 | 
 62 | ### 3D Avatar (face+body)
 63 | |Conference|Paper|Affiliation|Codebase|Notes|
 64 | |:---:|:---:|:---:|:---:|:---:|
 65 | |CVPR 2021|[Function4D: Real-time Human Volumetric Capture from Very Sparse Consumer RGBD Sensors](https://www.liuyebin.com/Function4D/Function4D.html)|Tsinghua University|[Dataset](https://github.com/ytrock/THuman2.0-Dataset)||
 66 | |ECCV 2022|[HuMMan: Multi-Modal 4D Human Dataset for Versatile Sensing and Modeling](https://caizhongang.com/projects/HuMMan/)|Shanghai Artificial Intelligence Laboratory|[Dataset](https://caizhongang.com/projects/HuMMan/)||
 67 | |SIGGRAPH 2023|[AvatarReX: Real-time Expressive Full-body Avatars](https://liuyebin.com/AvatarRex/)|Tsinghua University|[Dataset](https://github.com/lizhe00/AnimatableGaussians/blob/master/AVATARREX_DATASET.md)||
 68 | |arXiv 2024|[A Survey on 3D Human Avatar Modeling - From Reconstruction to Generation](https://arxiv.org/pdf/2406.04253 )|The University of Hong Kong |||
 69 | |arXiv 2024|[From Audio to Photoreal Embodiment: Synthesizing Humans in Conversations](https://people.eecs.berkeley.edu/~evonne_ng/projects/audio2photoreal/static/CCA.pdf)|Meta Reality Labs Research|[Code](https://github.com/facebookresearch/audio2photoreal/) ![Github stars](https://img.shields.io/github/stars/facebookresearch/audio2photoreal.svg) ![Github forks](https://img.shields.io/github/forks/facebookresearch/audio2photoreal.svg)|conversational avatar|
 70 | |CVPR 2024|[Animatable Gaussians: Learning Pose-dependent Gaussian Maps for High-fidelity Human Avatar Modeling](https://github.com/lizhe00/AnimatableGaussians?tab=readme-ov-file)|Tsinghua Univserity|[Code](https://github.com/lizhe00/AnimatableGaussians?tab=readme-ov-file) ![Github stars](https://img.shields.io/github/stars/lizhe00/AnimatableGaussians.svg) ![Github forks](https://img.shields.io/github/forks/lizhe00/AnimatableGaussians.svg)||
 71 | |CVPR 2024|[4K4D: Real-Time 4D View Synthesis at 4K Resolution](https://drive.google.com/file/d/1Y-C6ASIB8ofvcZkyZ_Vp-a2TtbiPw1Yx/view?usp=sharing)|Zhejiang University|[Code](https://github.com/zju3dv/4K4D) ![Github stars](https://img.shields.io/github/stars/zju3dv/4K4D.svg) ![Github forks](https://img.shields.io/github/forks/zju3dv/4K4D.svg)|real-time synthesis with 3DGS|
 72 | 
 73 | 
 74 | ### 2D talking-face synthesis
 75 | 
 76 | |Conference|Paper|Affiliation|Codebase|Training Code|Notes|
 77 | |:---:|:---:|:---:|:---:|:---:|:---|
 78 | |MM 2020|[Wav2Lip: Accurately Lip-sync Videos to Any Speech](https://arxiv.org/abs/2008.10010)|The International Institute of Islamic Thought (IIIT), India|[Code](https://github.com/Rudrabha/Wav2Lip) ![Github stars](https://img.shields.io/github/stars/Rudrabha/Wav2Lip.svg) ![Github forks](https://img.shields.io/github/forks/Rudrabha/Wav2Lip.svg)|✅|most accurate lip-sync model, bad video quality `96*96`, pre-trained on ~`180` hours video data from [LRS2](https://www.robots.ox.ac.uk/~vgg/data/lip_reading/lrs2.html)|
 79 | |MM 2021|[Imitating Arbitrary Talking Style for Realistic Audio-Driven Talking Face Synthesis](https://hcsi.cs.tsinghua.edu.cn/Paper/Paper21/MM21-WUHAOZHE.pdf)|Tsinghua University|[Code](https://github.com/wuhaozhe/style_avatar), ![Github stars](https://img.shields.io/github/stars/wuhaozhe/style_avatar.svg) ![Github forks](https://img.shields.io/github/forks/wuhaozhe/style_avatar.svg)|||
 80 | |CVPR 2021|[Pose-Controllable Talking Face Generation by Implicitly Modularized Audio-Visual Representation](https://arxiv.org/abs/2104.11116)|The Chinese University of Hong Kong|[Code](https://github.com/Hangz-nju-cuhk/Talking-Face_PC-AVS) ![Github stars](https://img.shields.io/github/stars/Hangz-nju-cuhk/Talking-Face_PC-AVS.svg) ![Github forks](https://img.shields.io/github/forks/Hangz-nju-cuhk/Talking-Face_PC-AVS.svg)||contrastive learning on audio-lip|
 81 | |ICCV 2021|[PIRenderer: Controllable Portrait Image Generation via Semantic Neural Rendering](https://arxiv.org/abs/2109.08379)|Peking University|[Code](https://github.com/RenYurui/PIRender) ![Github stars](https://img.shields.io/github/stars/RenYurui/PIRender.svg) ![Github forks](https://img.shields.io/github/forks/RenYurui/PIRender.svg)|||
 82 | |ECCV 2022|[StyleHEAT: One-Shot High-Resolution Editable Talking Face Generation via Pre-trained StyleGAN](https://arxiv.org/pdf/2203.04036.pdf)|Tsinghua University|[Code](https://github.com/OpenTalker/StyleHEAT) ![Github stars](https://img.shields.io/github/stars/OpenTalker/StyleHEAT.svg) ![Github forks](https://img.shields.io/github/forks/OpenTalker/StyleHEAT.svg)||High-fidenity synthesis via StyleGAN|
 83 | |SIGGRAPH Asia 2022|[VideoReTalking: Audio-based Lip Synchronization for Talking Head Video Editing In the Wild](https://github.com/OpenTalker/video-retalking)|Xidian University|[Code](https://github.com/OpenTalker/video-retalking) ![Github stars](https://img.shields.io/github/stars/OpenTalker/video-retalking.svg) ![Github forks](https://img.shields.io/github/forks/OpenTalker/video-retalking.svg)|||
 84 | |AAAI 2023|[DINet: Deformation Inpainting Network for Realistic Face Visually Dubbing on High Resolution Video](https://fuxivirtualhuman.github.io/pdf/AAAI2023_FaceDubbing.pdf)|Virtual Human Group, Netease Fuxi AI Lab|[Code](https://github.com/MRzzm/DINet)![Github stars](https://img.shields.io/github/stars/MRzzm/DINet.svg) ![Github forks](https://img.shields.io/github/forks/MRzzm/DINet.svg)|✅|accurate lip-sync and high-quality synthesis (`256*256`)|
 85 | |CVPR 2023|[SadTalker: Learning Realistic 3D Motion Coefficients for Stylized Audio-Driven Single Image Talking Face Animation](https://arxiv.org/pdf/2211.12194.pdf)|Xi'an Jiaotong University|[Code](https://github.com/Winfredy/SadTalker) ![Github stars](https://img.shields.io/github/stars/OpenTalker/SadTalker.svg) ![Github forks](https://img.shields.io/github/forks/OpenTalker/SadTalker.svg), [Note](https://github.com/Jason-cs18/awesome-avatar/blob/main/notes/sadtalker.md)|||
 86 | |arXiv 2023|[DreamTalk: When Expressive Talking Head Generation Meets Diffusion Probabilistic Models](https://arxiv.org/abs/2312.09767)|Tsinghua University|[Code](https://github.com/ali-vilab/dreamtalk), ![Github stars](https://img.shields.io/github/stars/ali-vilab/dreamtalk.svg) ![Github forks](https://img.shields.io/github/forks/ali-vilab/dreamtalk.svg)||diffusion|
 87 | |||Tencent TMElyralab|[MuseTalk: Real-Time High Quality Lip Synchorization with Latent Space Inpainting](https://github.com/TMElyralab/MuseTalk) ![Github stars](https://img.shields.io/github/stars/TMElyralab/MuseTalk.svg) ![Github forks](https://img.shields.io/github/forks/TMElyralab/MuseTalk.svg) |||
 88 | |arXiv 2024|[LivePortrait: Efficient Portrait Animation with Stitching and Retargeting Control](https://arxiv.org/abs/2407.03168)|Kuaishou Technology|[Code](https://github.com/KwaiVGI/LivePortrait) ![Github stars](https://img.shields.io/github/stars/KwaiVGI/LivePortrait.svg) ![Github forks](https://img.shields.io/github/forks/KwaiVGI/LivePortrait.svg) ||face reenactment with micro-expression|
 89 | |arXiv 2024|[EchoMimic: Lifelike Audio-Driven Portrait Animations through Editable Landmark Conditions](https://arxiv.org/abs/2407.08136)|Ant Group|[Code](https://github.com/BadToBest/EchoMimic) ![Github stars](https://img.shields.io/github/stars/BadToBest/EchoMimic.svg) ![Github forks](https://img.shields.io/github/forks/BadToBest/EchoMimic.svg)||accurate lip-sync on Chinese speakers, diffusion, pre-trained on `540 hours` cleaned video data (collected from internet)|
 90 | |arXiv 2024|[Hallo: Hierarchical Audio-Driven Visual Synthesis for Portrait Image Animation](https://arxiv.org/abs/2407.08136)|Fudan University|[Code](https://github.com/fudan-generative-vision/hallo), ![Github stars](https://img.shields.io/github/stars/fudan-generative-vision/hallo.svg) ![Github forks](https://img.shields.io/github/forks/fudan-generative-vision/hallo.svg)|✅|accurate lip-sync, diffusion, pre-trained on `264 hours` of cleaned video data (155 hours from internet and 9 hours from HDTF)|
 91 | |[arXiv 2024]|[Loopy: Taming Audio-Driven Portrait Avatar with Long-Term Motion Dependency](https://loopyavatar.github.io/)|Zhejiang University and ByteDance|||expressive animation driven by audio only, pre-trained on `160 hours` of cleaned video data (collected from internet)|
 92 | 
 93 | ### 3D talking-face synthesis
 94 | |Conference|Paper|Affiliation|Codebase|Notes|
 95 | |:---:|:---:|:---:|:---:|:---:|
 96 | |ICCV 2021|[AD-NeRF: Audio Driven Neural Radiance Fields for Talking Head Synthesis](https://arxiv.org/pdf/2103.11078)|University of Science and Technology of China|[Code](https://github.com/YudongGuo/AD-NeRF)![Github stars](https://img.shields.io/github/stars/YudongGuo/AD-NeRF.svg)![Github forks](https://img.shields.io/github/forks/YudongGuo/AD-NeRF.svg)||
 97 | |ECCV 2022|[Learning Dynamic Facial Radiance Fields for Few-Shot Talking Head Synthesis](https://github.com/sstzal/DFRF/blob/show_page/images/DFRF_eccv2022.pdf)|Tsinghua University|[Code](https://github.com/sstzal/DFRF)![Github stars](https://img.shields.io/github/stars/sstzal/DFRF.svg)![Github forks](https://img.shields.io/github/forks/sstzal/DFRF.svg)||
 98 | |ICLR 2023|[GeneFace: Generalized and High-Fidelity Audio-Driven 3D Talking Face Synthesis](https://arxiv.org/pdf/2301.13430)|Zhejiang University|[Code](https://github.com/yerfor/GeneFace)![Github stars](https://img.shields.io/github/stars/yerfor/GeneFace.svg)![Github forks](https://img.shields.io/github/forks/yerfor/GeneFace.svg)||
 99 | |ICCV 2023|[Efficient Region-Aware Neural Radiance Fields for High-Fidelity Talking Portrait Synthesis](https://openaccess.thecvf.com/content/ICCV2023/html/Li_Efficient_Region-Aware_Neural_Radiance_Fields_for_High-Fidelity_Talking_Portrait_Synthesis_ICCV_2023_paper.html)|Beihang University|[Code](https://github.com/Fictionarry/ER-NeRF)![Github stars](https://img.shields.io/github/stars/Fictionarry/ER-NeRF.svg)![Github forks](https://img.shields.io/github/forks/Fictionarry/ER-NeRF.svg)||
100 | |arXiv 2023|[GeneFace++: Generalized and Stable Real-Time Audio-Driven 3D Talking Face Generation](https://arxiv.org/pdf/2305.00787)|Zhejiang University|[Code](https://github.com/yerfor/GeneFacePlusPlus)![Github stars](https://img.shields.io/github/stars/yerfor/GeneFacePlusPlus.svg)![Github forks](https://img.shields.io/github/forks/yerfor/GeneFacePlusPlus.svg)||
101 | |CVPR 2024|[SyncTalk: The Devil is in the Synchronization for Talking Head Synthesi](https://arxiv.org/pdf/2311.17590)|Renmin University of China|[Code](https://github.com/ziqiaopeng/SyncTalk)![Github stars](https://img.shields.io/github/stars/ziqiaopeng/SyncTalk.svg)![Github forks](https://img.shields.io/github/forks/ziqiaopeng/SyncTalk.svg)||
102 | |ECCV 2024|[TalkingGaussian: Structure-Persistent 3D Talking Head Synthesis via Gaussian Splatting](https://github.com/Fictionarry/TalkingGaussian)|Beihang University|[Code](https://github.com/Fictionarry/TalkingGaussian)![Github stars](https://img.shields.io/github/stars/Fictionarry/TalkingGaussian.svg)![Github forks](https://img.shields.io/github/forks/Fictionarry/TalkingGaussian.svg)||
103 | 
104 | ### Talking-body synthesis
105 | 
106 | #### Pose2video
107 | 
108 | |Conference|Paper|Affiliation|Codebase|Notes|
109 | |:---:|:---:|:---:|:---:|:---:|
110 | |NeurIPS 2018|[Video-to-Video Synthesis](https://github.com/NVIDIA/vid2vid)|NVIDIA|[Code](https://github.com/NVIDIA/vid2vid) ![Github stars](https://img.shields.io/github/stars/NVIDIA/vid2vid.svg) ![Github forks](https://img.shields.io/github/forks/NVIDIA/vid2vid.svg)||
111 | |ICCV 2019|[Everybody Dance Now](https://github.com/carolineec/EverybodyDanceNow)|UC Berkeley|[Code](https://github.com/carolineec/EverybodyDanceNow)![Github stars](https://img.shields.io/github/stars/carolineec/EverybodyDanceNow.svg)![Github forks](https://img.shields.io/github/forks/carolineec/EverybodyDanceNow.svg)||
112 | |arXiv 2023|[Animate Anyone: Consistent and Controllable Image-to-Video Synthesis for Character Animation](https://arxiv.org/pdf/2311.17117.pdf)|Alibaba Group|[Code](https://github.com/HumanAIGC/AnimateAnyone)![Github stars](https://img.shields.io/github/stars/HumanAIGC/AnimateAnyone.svg)![Github forks](https://img.shields.io/github/forks/HumanAIGC/AnimateAnyone.svg)||
113 | |CVPR 2024|[MagicAnimate: Temporally Consistent Human Image Animation using Diffusion Model](https://github.com/magic-research/magic-animate/blob/main/assets/preprint/MagicAnimate.pdf)|National University of Singapore|[Code](https://github.com/magic-research/magic-animate)![Github stars](https://img.shields.io/github/stars/magic-research/magic-animate.svg)![Github forks](https://img.shields.io/github/forks/magic-research/magic-animate.svg)||
114 | |arXiv 2024|[Champ: Controllable and Consistent Human Image Animation with 3D Parametric Guidance](https://arxiv.org/pdf/2403.14781)|Nanjing University|[Code](https://github.com/fudan-generative-vision/champ)![Github stars](https://img.shields.io/github/stars/fudan-generative-vision/champ.svg)![Github forks](https://img.shields.io/github/forks/fudan-generative-vision/champ.svg)||
115 | |Github repo|[MuseV: Infinite-length and High Fidelity Virtual Human Video Generation with Visual Conditioned Parallel Denoising](https://github.com/TMElyralab/MuseV)|Tencent TMElyralab|[Code](https://github.com/TMElyralab/MuseV)![Github stars](https://img.shields.io/github/stars/TMElyralab/MuseV.svg)![Github forks](https://img.shields.io/github/forks/TMElyralab/MuseV.svg)||
116 | |Github repo|[MusePose: a Pose-Driven Image-to-Video Framework for Virtual Human Generation](https://github.com/TMElyralab/MusePose)|Tencent|[Code](https://github.com/TMElyralab/MusePose)![Github stars](https://img.shields.io/github/stars/TMElyralab/MusePose.svg)![Github forks](https://img.shields.io/github/forks/TMElyralab/MusePose.svg) ⭐||
117 | |arXiv 2024|[ControlNeXt: Powerful and Efficient Control for Image and Video Generation](https://pbihao.github.io/projects/controlnext/index.html)|The Chinese University of Hong Kong|[Code](https://github.com/dvlab-research/ControlNeXt)![Github stars](https://img.shields.io/github/stars/dvlab-research/ControlNeXt.svg)![Github forks](https://img.shields.io/github/forks/dvlab-research/ControlNeXt.svg)|stable video diffusion|
118 | |[arXiv 2024]|[CyberHost: Taming Audio-driven Avatar Diffusion Model with Region Codebook Attention](https://cyberhost.github.io/)|Zhejiang University and ByteDance||pre-trained on `200 hours` video data and more than `10k` unique identities|
119 | 
120 | ## Datasets
121 | 
122 | ### Talking-face
123 | <!-- ||||||||
124 | |-:|:-:|:-:|:-:|:-:|:-:|:-:|
125 | |Dataset name|Environment|Year|Resolution|Subject|Duration|Sentence|
126 | : Table caption -->
127 | 
128 | <table>
129 | 	<tr>
130 | 	    <th colspan="7"><center>Audio-Visual Datasets for Enlish Speakers</center></th>
131 | 	</tr >
132 | 	<tr>
133 | 	    <td >Dataset name</td>
134 | 	    <td>Environment</td>
135 | 	    <td>Year</td>
136 |         <td>Resolution</td>
137 |         <td>Subject</td>
138 |         <td>Duration</td>
139 |         <td>Sentence</td> 
140 | 	</tr >
141 |     <tr>
142 | 	    <td ><a href="https://www.robots.ox.ac.uk/~vgg/data/voxceleb/vox1.html">VoxCeleb1</a></td>
143 | 	    <td>Wild</td>
144 | 	    <td>2017</td>
145 |         <td>360p~720p</td>
146 |         <td>1251</td>
147 |         <td>352 hours</td>
148 |         <td>100k</td> 
149 | 	</tr >
150 |     <tr>
151 | 	    <td ><a href="https://www.robots.ox.ac.uk/~vgg/data/voxceleb/vox2.html">VoxCeleb2</a></td>
152 | 	    <td>Wild</td>
153 | 	    <td>2018</td>
154 |         <td>360p~720p</td>
155 |         <td>6112</td>
156 |         <td>2442 hours</td>
157 |         <td>1128k</td> 
158 | 	</tr >
159 |     <tr>
160 | 	    <td ><a href="https://github.com/MRzzm/HDTF">HDTF</a></td>
161 | 	    <td>Wild</td>
162 | 	    <td>2020</td>
163 |         <td>720p~1080p</td>
164 |         <td>300+</td>
165 |         <td>15.8 hours</td>
166 |         <td></td> 
167 | 	</tr >
168 |     <tr>
169 | 	    <td><a href="https://github.com/YuanxunLu/LiveSpeechPortraits">LSP</a></td>
170 | 	    <td>Wild</td>
171 | 	    <td>2021</td>
172 |         <td>720p~1080p</td>
173 |         <td>4</td>
174 |         <td>18 minutes</td>
175 |         <td>100k</td> 
176 | 	</tr >
177 |     <tr>
178 | 	    <th colspan="7"><center>Audio-Visual Datasets for Chinese Speakers</center></th>
179 | 	</tr >
180 | 	<tr>
181 | 	    <td >Dataset name</td>
182 | 	    <td>Environment</td>
183 | 	    <td>Year</td>
184 |         <td>Resolution</td>
185 |         <td>Subject</td>
186 |         <td>Duration</td>
187 |         <td>Sentence</td> 
188 | 	</tr >
189 |     <tr>
190 | 	    <td ><a href="https://www.vipazoo.cn/CMLR.html">CMLR</a></td>
191 | 	    <td>Lab</td>
192 | 	    <td>2019</td>
193 |         <td></td>
194 |         <td>11</td>
195 |         <td></td>
196 |         <td>102k</td> 
197 | 	</tr >
198 |     <tr>
199 | 	    <td > <a href="https://github.com/SpringHuo/MAVD"><b>MAVD</b></a></td>
200 | 	    <td>Lab</td>
201 | 	    <td>2023</td>
202 |         <td>1920x1080</td>
203 |         <td>64</td>
204 |         <td>24 hours</td>
205 |         <td>12k</td> 
206 | 	</tr >
207 |     <tr>
208 | 	    <td ><a href="http://cnceleb.org/">CN-Celeb</a></td>
209 | 	    <td>Wild</td>
210 | 	    <td>2020</td>
211 |         <td></td>
212 |         <td>3000</td>
213 |         <td>1200 hours</td>
214 |         <td></td> 
215 | 	</tr >
216 |     <tr>
217 | 	    <td ><a href="http://cnceleb.org/">CN-Celeb-AV</a></td>
218 | 	    <td>Wild</td>
219 | 	    <td>2023</td>
220 |         <td></td>
221 |         <td>1136</td>
222 |         <td>660 hours</td>
223 |         <td></td> 
224 | 	</tr >
225 |     <tr>
226 | 	    <td ><a href="http://cnceleb.org/"><b>CN-CVS</b></a></td>
227 | 	    <td>Wild</td>
228 | 	    <td>2023</td>
229 |         <td></td>
230 |         <td>2500+</td>
231 |         <td>300+ hours</td>
232 |         <td></td> 
233 | 	</tr >
234 | 	<!-- <tr >
235 | 	    <td>应用层</td>
236 | 	    <td rowspan="3">应用层</td>
237 | 	    <td rowspan="3">应用层</td>
238 | 	</tr>
239 | 	<tr>
240 | 	    <td>表示层</td>
241 | 	</tr>
242 | 	<tr>
243 | 	    <td>会话层</td>
244 | 	</tr>
245 | 	<tr>
246 | 	    <td>传输层</td>
247 | 	    <td>传输层</td>
248 |        <td>传输层</td>
249 | 	</tr>
250 | 	<tr>
251 |        <td>网络层</td>
252 | 	    <td>网络层</td>
253 |        <td>网络层</td>
254 | 	</tr>
255 | 	<tr>
256 | 	    <td>数据链路层</td>
257 | 	    <td rowspan="2">网络接口层</td>
258 |        <td>数据链路层</td>
259 | 	</tr>
260 | 	<tr>
261 | 	    <td>物理层</td>
262 | 	    <td>物理层</td>
263 | 	</tr> -->
264 | </table>
265 | 
266 | 
267 | ## Metrics
268 | 
269 | ### Talking-face
270 | <table>
271 | 	<tr>
272 | 	    <th colspan="3"><center>Lip-Sync</center></th>
273 | 	</tr >
274 | 	<tr>
275 | 	    <td >Metric name</td>
276 | 	    <td>Description</td>
277 | 	    <td>Code/Paper</td>
278 | 	</tr >
279 |     <tr>
280 | 	    <td >LMD&#8595;</td>
281 | 	    <td>Mouth landmark distance</td>
282 | 	    <td></td>
283 | 	</tr >
284 |     <tr>
285 | 	    <td >LMD&#8595;</td>
286 | 	    <td>Mouth landmark distance</td>
287 | 	    <td></td>
288 | 	</tr >
289 |     <tr>
290 | 	    <td >MA&#8593;</td>
291 | 	    <td>The Insertion-over-Union (IoU) for the overlap between the predicted mouth area and the ground truth area</td>
292 | 	    <td></td>
293 | 	</tr >
294 |     <tr>
295 | 	    <td >Sync&#8593;</td>
296 | 	    <td>The confidence score from SyncNet (Sync)</td>
297 | 	    <td><a href="https://github.com/Rudrabha/Wav2Lip/tree/master/evaluation">wav2lip</a></td>
298 | 	</tr >
299 |     <tr>
300 | 	    <td >LSE-C&#8593;</td>
301 | 	    <td>Lip Sync Error - Confidence</td>
302 | 	    <td><a href="https://github.com/Rudrabha/Wav2Lip/tree/master/evaluation">wav2lip</a></td>
303 | 	</tr >
304 |     <tr>
305 | 	    <td >LSE-D&#8595;</td>
306 | 	    <td>Lip Sync Error - Distance</td>
307 | 	    <td><a href="https://github.com/Rudrabha/Wav2Lip/tree/master/evaluation">wav2lip</a></td>
308 | 	</tr >
309 |     <tr>
310 | 	    <th colspan="3"><center>Image Quality (identity preserving)</center></th>
311 | 	</tr >
312 | 	<tr>
313 | 	    <td >Metric name</td>
314 | 	    <td>Description</td>
315 | 	    <td>Code/Paper</td>
316 | 	</tr >
317 |     <tr>
318 | 	    <td >MAE&#8595;</td>
319 | 	    <td>Mean Absolute Error metric for image</td>
320 | 	    <td><a href="https://github.com/open-mmlab/mmagic/blob/a2135e11f3d66c1a2828c71221ed5e5699ecd919/docs/en/user_guides/metrics.md">mmagic</a></td>
321 | 	</tr >
322 |     <tr>
323 | 	    <td >MSE&#8595;</td>
324 | 	    <td>Mean Squared Error metric for image</td>
325 | 	    <td><a href="https://github.com/open-mmlab/mmagic/blob/a2135e11f3d66c1a2828c71221ed5e5699ecd919/docs/en/user_guides/metrics.md">mmagic</a></td>
326 | 	</tr >
327 |     <tr>
328 | 	    <td >PSNR&#8593;</td>
329 | 	    <td>Peak Signal-to-Noise Ratio</td>
330 | 	    <td><a href="https://github.com/open-mmlab/mmagic/blob/a2135e11f3d66c1a2828c71221ed5e5699ecd919/docs/en/user_guides/metrics.md">mmagic</a></td>
331 | 	</tr >
332 |     <tr>
333 | 	    <td >SSIM&#8593;</td>
334 | 	    <td>Structural similarity for image</td>
335 | 	    <td><a href="https://github.com/open-mmlab/mmagic/blob/a2135e11f3d66c1a2828c71221ed5e5699ecd919/docs/en/user_guides/metrics.md">mmagic</a></td>
336 | 	</tr >
337 |     <tr>
338 | 	    <td >FID&#8595;</td>
339 | 	    <td>Frchet Inception Distance</td>
340 | 	    <td><a href="https://github.com/open-mmlab/mmagic/blob/a2135e11f3d66c1a2828c71221ed5e5699ecd919/docs/en/user_guides/metrics.md">mmagic</a></td>
341 | 	</tr >
342 |     <tr>
343 | 	    <td >IS&#8593;</td>
344 | 	    <td>Inception score </td>
345 | 	    <td><a href="https://github.com/open-mmlab/mmagic/blob/a2135e11f3d66c1a2828c71221ed5e5699ecd919/docs/en/user_guides/metrics.md">mmagic</a></td>
346 | 	</tr >
347 |     <tr>
348 | 	    <td >NIQE&#8595;</td>
349 | 	    <td>Natural Image Quality Evaluator metric</td>
350 | 	    <td><a href="https://github.com/open-mmlab/mmagic/blob/a2135e11f3d66c1a2828c71221ed5e5699ecd919/docs/en/user_guides/metrics.md">mmagic</a></td>
351 | 	</tr >
352 |     <tr>
353 | 	    <td >CSIM&#8593;</td>
354 | 	    <td>The cosine similarity of identity embedding</td>
355 | 	    <td><a href="https://github.com/deepinsight/insightface/issues/903">InsightFace</a></td>
356 | 	</tr >
357 |     <tr>
358 | 	    <td >CPBD&#8593;</td>
359 | 	    <td>The cumulative probability blur detection</td>
360 | 	    <td><a href="https://pypi.org/project/cpbd/">python-cpbd</a></td>
361 | 	</tr >
362 |     <tr>
363 | 	    <th colspan="3"><center>Diversity</center></th>
364 | 	</tr >
365 | 	<tr>
366 | 	    <td >Metric name</td>
367 | 	    <td>Description</td>
368 | 	    <td>Code/Paper</td>
369 | 	</tr >
370 |     <tr>
371 | 	    <td >Diversity of head motions&#8593;</td>
372 | 	    <td>A standard deviation of the head motion feature embeddings extracted from the generated frames using Hopenet (Ruiz et al., 2018) is calculated</td>
373 | 	    <td><a href="https://arxiv.org/pdf/2211.12194.pdf">SadTalker</a></td>
374 | 	</tr >
375 |      <tr>
376 | 	    <td >Beat Align Score&#8593;</td>
377 | 	    <td>The alignment of the audio and generated head motions is calculated in Bailando (Siyao et al., 2022)</td>
378 | 	    <td><a href="https://arxiv.org/pdf/2211.12194.pdf">SadTalker</a></td>
379 | 	</tr >
380 | </table>
381 | 
382 | ## Toolbox
383 | 1. A general toolbox for AIGC, including common metrics and models https://github.com/open-mmlab/mmagic
384 | 2. face3d: Python tools for processing 3D face https://github.com/yfeng95/face3d
385 | 3. 3DMM model fitting using Pytorch https://github.com/ascust/3DMM-Fitting-Pytorch
386 | 4. OpenFace: a facial behavior analysis toolkit https://github.com/TadasBaltrusaitis/OpenFace
387 | 5. autocrop: Automatically detects and crops faces from batches of pictures https://github.com/leblancfg/autocrop
388 | 6. OpenPose: Real-time multi-person keypoint detection library for body, face, hands, and foot estimation https://github.com/CMU-Perceptual-Computing-Lab/openpose
389 | 7. GFPGAN: Practical Algorithm for Real-world Face Restoration https://github.com/TencentARC/GFPGAN
390 | 8. CodeFormer: Robust Blind Face Restoration https://github.com/sczhou/CodeFormer
391 | 9. metahuman-stream: Real time interactive streaming digital human https://github.com/lipku/metahuman-stream
392 | 10. EasyVolcap: a PyTorch library for accelerating neural volumetric video research https://github.com/zju3dv/EasyVolcap
393 | 11. 3D Model in gradio https://www.gradio.app/guides/how-to-use-3D-model-component
394 | 
395 | ### Automatic Speech Recognition (ASR)
396 | 1. BELLE-2/Belle-whisper-large-v3-zh https://huggingface.co/BELLE-2/Belle-whisper-large-v3-zh
397 | 2. SenseVoice (multilingual) https://github.com/FunAudioLLM/SenseVoice 👍👍
398 | 
399 | ### Text to Speech (TTS)
400 | 1. CosyVoice, Alibaba Tongyi SpeechTeam https://github.com/FunAudioLLM/CosyVoice 👍👍
401 | 2. FireRedTTS, FireReadTeam https://github.com/FireRedTeam/FireRedTTS
402 | 3. GPT-SoVITS https://github.com/RVC-Boss/GPT-SoVITS?tab=readme-ov-file
403 | 
404 | ### Speech to Speech (GPT4-o)
405 | 1. Mini-Omni, Tsinghua University https://github.com/gpt-omni/mini-omni
406 | 2. Speech To Speech, HuggingFace https://github.com/huggingface/speech-to-speech
407 | 
408 | ## Related Links
409 | If you are interested in avatar and digital human, we would also like to recommend you to check out other related collections:
410 | - awesome digital human https://github.com/weihaox/awesome-digital-human
411 | 


--------------------------------------------------------------------------------