└── README.md /README.md: -------------------------------------------------------------------------------- 1 | # Generative AI for Character Animation: A Comprehensive Survey of Techniques, Applications, and Future Directions 2 | 3 | [![arXiv](https://img.shields.io/badge/arXiv-Paper-.svg)](https://www.arxiv.org/abs/2504.19056) [![Website](https://img.shields.io/website?url=https%3A%2F%2Fmultimodalrag.github.io%2F)](https://generative-ai-for-character-animation.github.io/) 4 | 5 | 6 | This repository is designed to collect and categorize papers, datasets, and resources related to generative AI for character animation based on our survey. As advances in generative AI continue to transform animation from realistic facial synthesis to dynamic gesture and motion generation, this resource will be continuously updated to serve as a comprehensive guide for researchers and practitioners. Given the rapid growth in this field, we will continuously update both the paper and this repository to serve as a resource for researchers working on future projects. 7 | 8 | --- 9 | 10 | ## πŸ“’ News 11 | - **April 27, 2025**: We release the first version of our survey. 12 | 13 | 14 | *Feel free to cite, contribute, or open a pull request to add recent related papers!* 15 | 16 | ## πŸ“‘ List of Contents 17 | 18 | - [πŸ“ Abstract](#-abstract) 19 | - [πŸ—Ί Overview](#-overview) 20 | - [🌳 Taxonomy](#-taxonomy) 21 | - [πŸ“š Background](#-background) 22 | - [πŸ€– Models](#-models) 23 | - [🎨 Computer Graphics Models](#-computer-graphics-models) 24 | - [πŸ‘€ Vision](#-vision) 25 | - [πŸ“ Language Models](#-language-models) 26 | - [πŸ•’ Temporal Sequence Modeling](#-temporal-sequence-modeling) 27 | - [πŸ—£ Speech Models](#-speech-models) 28 | - [🎭 Additional Generative Models](#-additional-generative-models) 29 | - [πŸ“Š Metrics](#-metrics) 30 | - [βœ… Quality and Realism of Generated Output](#-quality-and-realism-of-generated-output) 31 | - [πŸ”„ Diversity and Multimodality](#-diversity-and-multimodality) 32 | - [🎯 Relevance and Accuracy](#-relevance-and-accuracy) 33 | - [πŸƒ Physical Plausibility and Interaction](#-physical-plausibility-and-interaction) 34 | - [⚑️ Efficiency and Computational Metrics](#-efficiency-and-computational) 35 | - [πŸ‘¨ Face](#-face) 36 | - [πŸ—‚ Datasets](#-datasets) 37 | - [πŸ€– Models](#-models-1) 38 | - [πŸ˜ƒ Expression](#-expression) 39 | - [πŸ—‚ Datasets](#-datasets-1) 40 | - [πŸ€– Models](#-models-2) 41 | - [πŸŽ™ Speech-Driven & Multimodal Expression Generation](#-speech-driven--multimodal-expression-generation) 42 | - [πŸ” Expression Retargeting & Motion Transfer](#-expression-retargeting--motion-transfer) 43 | - [πŸ–Ό Image](#-image) 44 | - [πŸ—‚ Datasets](#-datasets-2) 45 | - [πŸ€– Models](#-models-3) 46 | - [πŸ”§ Fine-Tuning & Regularization](#-fine-tuning--regularization) 47 | - [βœ‚ Image Editing & Disentanglement](#-image-editing--disentanglement) 48 | - [πŸ‘½ Multimodal Conversations & Visual Understanding](#-multimodal-conversations--visual-understanding) 49 | - [πŸ‘€ Avatar](#-avatar) 50 | - [πŸ—‚ Datasets](#-datasets-3) 51 | - [πŸ€– Models](#-models-4) 52 | - [πŸ” CLIP-Guided Models](#-clip-guided-models) 53 | - [🧩 Implicit Function-Based Models](#-implicit-function-based-models) 54 | - [πŸŽ₯ NeRF-Based Methods](#-nerf-based-methods) 55 | - [🌈 Diffusion-Based Methods](#-diffusion-based-methods) 56 | - [πŸ”€ Hybrid Methods](#-hybrid-methods) 57 | - [🀝 Gesture](#-gesture) 58 | - [πŸ—‚ Datasets](#-datasets-4) 59 | - [πŸ€– Models](#-models-5) 60 | - [πŸ›  Traditional & Parametric Approaches](#-traditional--parametric-approaches) 61 | - [🧠 Deep Learning-Based Models](#-deep-learning-based-models) 62 | - [πŸš€ Transformer-Based Models](#-transformer-based-models) 63 | - [πŸŽ₯ Motion](#-motion) 64 | - [πŸ—‚ Datasets](#-datasets-5) 65 | - [πŸ€– Models](#-models-6) 66 | - [πŸ”€ Language-to-Pose Models](#-language-to-pose-models) 67 | - [πŸ“¦ Variational Auto-Encoder (VAE) Based Models](#-variational-auto-encoder-vae-based-models) 68 | - [πŸ— VQ-VAE Based Models](#-vq-vae-based-models) 69 | - [🌈 Diffusion-Based Models](#-diffusion-based-models) 70 | - [πŸ“¦ Object](#-object) 71 | - [πŸ—‚ Datasets](#-datasets-6) 72 | - [πŸ€– Models](#-models-7) 73 | - [🧡 Texture](#-texture) 74 | - [πŸ—‚ Datasets](#-datasets-7) 75 | - [πŸ€– Models](#-models-8) 76 | 79 | 80 | --- 81 | 82 | ## πŸ“ Abstract 83 | 84 | Generative AI is transforming various fields, including art, gaming, and animation. One of its most significant applications lies in animation, where advances in artificial intelligenceβ€”such as foundation models and diffusion modelsβ€”have driven remarkable progress, significantly reducing the time and cost of content creation. Characters are central components of animations involving elements such as motion, emotions, gestures, and facial expressions. Rapid and wide-ranging developments in AI-driven animation technologies have made it challenging to maintain an overarching view of progress in the field, highlighting the need for a comprehensive survey to integrate and contextualize these advancements. 85 | 86 | This survey offers a comprehensive review of the state-of-the-art generative AI applications for animated character design and behavior, integrating a wide range of aspects often examined in isolation (e.g., avatars, gestures, and facial expressions). Unlike previous studies, it provides a unified perspective covering all major applications of generative AI in character animation. The survey begins with foundational concepts and introduces evaluation metrics tailored to this domain, then explores key areas such as facial animation, image synthesis, avatar generation, gesture modeling, motion synthesis, expression rendering, and texture generation. Finally, it addresses the main challenges and outlines future research directions, offering a roadmap to advance AI-driven character animation technologies. This survey aims to serve as a resource for researchers and developers in generative AI for animation and related fields. 87 | 88 | --- 89 | ## πŸ—Ί Overview 90 | [![overview.png](https://i.postimg.cc/GtNMTS6m/overview.png)](https://postimg.cc/rR1Gvg8B) 91 | --- 92 | ## 🌳 Taxonomy 93 | 94 | ![Taxonomy_page-0001](https://github.com/user-attachments/assets/a4396171-01b9-47ed-a717-b0381bd26f4e) 95 | --- 96 | 97 | ## πŸ“š Background 98 | 99 | ### πŸ€– Models 100 | 101 | #### 🎨 Computer Graphics Models 102 | - **SMPL** [πŸ”—](https://smpl.is.tue.mpg.de/) 103 | *A popular parametric model representing 3D human body geometry using a low-dimensional representation for shape (Ξ²) and pose (ΞΈ).* 104 | - **SMPL+H** 105 | *An extension of SMPL that incorporates detailed hand modeling by introducing hand joint parameters (ΞΈhands).* 106 | - **SMPL-X** 107 | *Further extends SMPL+H by including facial expressions along with detailed hand and body modeling for full-body human representation.* 108 | - **SMIL (Skinned Multi-Infant Linear Model)** [πŸ”—](https://files.is.tue.mpg.de/black/papers/miccai18.pdf) 109 | *A model developed specifically for infants, addressing challenges in capturing non-cooperative subjects with low-quality RGB-D data.* 110 | - **SMAL (Skinned Multi-Animal Linear Model)** [πŸ”—](https://smal.is.tue.mpg.de/) 111 | *Designed for 3D modeling of animals, enabling the creation of a shape space from a few scans of diverse species.* 112 | 113 | #### πŸ‘€ Vision 114 | - **Convolutional Neural Networks (CNNs)** [πŸ”—](https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf) 115 | *CNNs are specialized for image-related tasks by using convolution, pooling, and fully connected layers.* 116 | - **3D CNNs** [πŸ”—](https://www.cv-foundation.org/openaccess/content_iccv_2015/html/Tran_Learning_Spatiotemporal_Features_ICCV_2015_paper.html) 117 | *Extend CNNs to process volumetric data (e.g., videos, MRI scans) by using 3D convolutional kernels.* 118 | - **U-Net** [πŸ”—](https://arxiv.org/abs/1505.04597) 119 | *A U-shaped network architecture designed for biomedical image segmentation, known for its efficient denoising and skip connections.* 120 | - **Inception** [πŸ”—](https://arxiv.org/abs/1409.4842) 121 | *Introduces multi-scale processing via parallel convolutions (1x1, 3x3, 5x5) for improved feature extraction.* 122 | - **VGG** [πŸ”—](https://arxiv.org/abs/1409.1556) 123 | *Evaluate the impact of increasing CNN depth using very small (3x3) filters to capture complex visual features.* 124 | - **ResNet** [πŸ”—](https://arxiv.org/abs/1512.03385) 125 | *Introduces residual learning with shortcut connections to enable training of very deep networks (up to 152 layers).* 126 | - **Vision Transformers (ViTs)** [πŸ”—](https://arxiv.org/abs/2010.11929) 127 | *Applies the self-attention mechanism to image patches, offering competitive performance on image recognition tasks.* 128 | 129 | #### πŸ“ Language Models 130 | - **RNNs** [πŸ”—](https://en.wikipedia.org/wiki/Recurrent_neural_network) 131 | *General recurrent neural networks for sequence modeling.* 132 | - **Bidirectional RNNs (BRNNs)** [πŸ”—](https://ieeexplore.ieee.org/document/650093) 133 | *Process sequences in both directions to leverage past and future context.* 134 | - **Encoder-Decoder Frameworks** [πŸ”—](https://arxiv.org/abs/1409.3215) 135 | *Used for tasks like machine translation by compressing sequences into a fixed-length vector.* 136 | - **LSTMs** [πŸ”—](https://www.bioinf.jku.at/publications/older/2604.pdf) 137 | *Introduces memory cells and gating mechanisms to capture long-term dependencies.* 138 | - **GRUs** [πŸ”—](https://arxiv.org/abs/1406.1078) 139 | *A streamlined variant of LSTMs merging input and forget gates into an update gate.* 140 | - **Attention Mechanisms** [πŸ”—](https://arxiv.org/abs/1706.03762) 141 | *Allows models to dynamically focus on different parts of the input sequence.* 142 | - **Transformers** [πŸ”—](https://arxiv.org/abs/1706.03762) 143 | *Utilize self-attention to process sequences without recurrence.* 144 | - **BERT** [πŸ”—](https://arxiv.org/abs/1810.04805) 145 | *Bidirectional Encoder Representations from Transformers for deep language understanding.* 146 | - **GPT Series:** 147 | - **GPT-1** [πŸ”—](https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf) 148 | - **GPT-2** [πŸ”—](https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf) 149 | - **GPT-3** [πŸ”—](https://arxiv.org/abs/2005.14165) 150 | - **GPT-3.5 / ChatGPT** 151 | - **InstructGPT** [πŸ”—](https://arxiv.org/abs/2203.02155) 152 | - **GPT-4** [πŸ”—](https://arxiv.org/abs/2303.08774) 153 | - **GPT-4-O** [πŸ”—](https://arxiv.org/abs/2410.21276) 154 | - **PoseGPT** [πŸ”—](https://arxiv.org/abs/2210.10542) 155 | *Specialized for pose estimation in video generation.* 156 | - **GestureGPT** [πŸ”—](https://arxiv.org/abs/2310.12821) 157 | *Extends the GPT framework to generate realistic human gestures based on text or audio input.* 158 | - **MotionGPT** [πŸ”—](https://arxiv.org/abs/2306.14795) 159 | *Designed for generating motion sequences.* 160 | 161 | #### πŸ•’ Temporal Sequence Modeling 162 | - **Temporal Convolutional Networks (TCNs)** [πŸ”—](https://arxiv.org/abs/1803.01271) 163 | *Use causal and dilated convolutions to model sequential data efficiently.* 164 | - **Transformer-XL** [πŸ”—](https://arxiv.org/abs/1901.02860) 165 | *Extends Transformers with a segment-level recurrence mechanism to capture long-range dependencies.* 166 | - **ConvLSTM** [πŸ”—](https://arxiv.org/abs/1506.04214) 167 | *Combines CNNs and LSTM units to capture both spatial and temporal dynamics in spatiotemporal data.* 168 | 169 | #### πŸ—£ Speech Models 170 | - **WaveNet** [πŸ”—](https://arxiv.org/abs/1609.03499) 171 | *An autoregressive model for raw audio synthesis using dilated causal convolutions.* 172 | - **Tacotron** [πŸ”—](https://arxiv.org/abs/1703.10135) 173 | *A sequence-to-sequence TTS model that converts text to mel-spectrograms via attention.* 174 | - **Tacotron 2** [πŸ”—](https://arxiv.org/abs/1712.05884) 175 | *Combines Tacotron with a WaveNet vocoder for end-to-end, high-fidelity speech synthesis.* 176 | - **FastSpeech** [πŸ”—](https://arxiv.org/abs/1905.09263) 177 | *A non-autoregressive TTS model using transformers for parallel synthesis to reduce latency.* 178 | - **FastSpeech 2** [πŸ”—](https://arxiv.org/abs/2006.04558) 179 | *Improves FastSpeech by introducing variance predictors for pitch, energy, and duration for more natural speech.* 180 | - **Wave2Vec** [πŸ”—](https://arxiv.org/abs/1904.05862) 181 | *A self-supervised framework for learning robust speech representations directly from raw audio.* 182 | - **Wave2Vec 2.0** [πŸ”—](https://arxiv.org/abs/2006.11477) 183 | *Enhances Wave2Vec with quantization and contextual embeddings to improve ASR performance.* 184 | - **HuBERT** [πŸ”—](https://arxiv.org/abs/2106.07447) 185 | *Uses clustering-based pseudo-labeling and masked prediction to learn effective speech representations.* 186 | - **Whisper** [πŸ”—](https://arxiv.org/pdf/2212.04356) 187 | *A transformer-based model for multilingual ASR, translation, and transcription with zero-shot capabilities.* 188 | - **SeamlessM4T** [πŸ”—](https://arxiv.org/abs/2308.11596) 189 | *An end-to-end model for universal speech translation and generation that preserves speaker emotion via attention.* 190 | 191 | #### 🎭 Additional Generative Models 192 | - **GANs (Generative Adversarial Networks)** [πŸ”—](https://arxiv.org/abs/1406.2661) 193 | *An adversarial framework where a generator and discriminator engage in a minimax game to synthesize realistic data.* 194 | - **CycleGAN** [πŸ”—](https://arxiv.org/abs/1703.10593) 195 | *Enables unpaired image-to-image translation by enforcing cycle consistency between two domains.* 196 | - **Autoencoders** 197 | *A general framework that compresses input data into a latent representation and reconstructs it for unsupervised learning.* 198 | - **Variational Autoencoders (VAEs)** [πŸ”—](https://arxiv.org/abs/1312.6114) 199 | *Probabilistic autoencoders that regularize the latent space using KL divergence to generate new data samples.* 200 | - **Vector Quantized VAEs (VQ-VAEs)** [πŸ”—](https://arxiv.org/abs/1711.00937) 201 | *Enhances VAEs by discretizing the latent space with a codebook for more structured representations.* 202 | - **NeRF (Neural Radiance Fields)** [πŸ”—](https://arxiv.org/abs/2003.08934) 203 | *Learns an implicit 3D scene representation via volumetric rendering for novel view synthesis.* 204 | - **3D Gaussian Splatting (3DGS)** [πŸ”—](https://arxiv.org/abs/2308.04079) 205 | *Represents 3D scenes with a collection of Gaussian functions for efficient real-time rendering.* 206 | - **Denoising Diffusion Probabilistic Models (DDPMs)** [πŸ”—](https://arxiv.org/abs/2006.11239) 207 | *Generates high-quality outputs by iteratively denoising data from a latent space.* 208 | - **ControlNet** [πŸ”—](https://arxiv.org/abs/2302.05543) 209 | *Augments diffusion models with auxiliary conditioning inputs for precise image generation.* 210 | - **DALL-E** [πŸ”—](https://arxiv.org/abs/2102.12092) 211 | *An autoregressive transformer that generates images from text by jointly modeling text and image tokens.* 212 | 213 | --- 214 | 215 | ### πŸ“Š Metrics 216 | 217 | #### βœ… Quality and Realism of Generated Output 218 | 219 | These metrics assess how natural, realistic, and perceptually convincing the generated content appears. 220 | 221 | | **Metric** | **Description** | **Formula** | 222 | |------------|-----------------|-------------| 223 | | **FrΓ©chet Inception Distance (FID)** | Measures statistical distance between real and generated images. | $\text{FID} = \lVert \mu_r - \mu_g \rVert^2 + \text{tr}(\Sigma_r + \Sigma_g - 2(\Sigma_r\Sigma_g)^{1/2})$ | 224 | | **CLIP Score** | Evaluates semantic similarity between generated images and textual descriptions. | $\text{CLIPScore} = \frac{t \cdot i}{\lVert t \rVert \lVert i \rVert}$ | 225 | | **Mean Squared Error (MSE)** | Measures pixel-wise difference between generated and real images. | $\text{MSE} = \frac{1}{n}\sum_{i=1}^{n}(x_i - y_i)^2$ | 226 | | **Learned Perceptual Image Patch Similarity (LPIPS)** | Assesses perceptual similarity using deep feature embeddings. | $\text{LPIPS}(x,y) = \sum_l \frac{1}{H_l W_l} \sum_{h=1}^{H_l}\sum_{w=1}^{W_l}\lVert \phi_l(x)^{h,w}-\phi_l(y)^{h,w} \rVert_2^2$ | 227 | | **Identity Consistency** | Ensures identity preservation in generated faces by computing cosine similarity. | $\text{IC} = \frac{1}{N}\sum_{i=1}^{N} \text{cosine-sim}\Bigl(f(x_i), f(y_i)\Bigr)$ | 228 | | **FrΓ©chet Gesture Distance (FGD)** | Measures statistical differences between real and generated gesture distributions. | $\text{FGD} = \lVert \mu_{\text{real}} - \mu_{\text{gen}} \rVert^2 + \text{tr}(\Sigma_{\text{real}} + \Sigma_{\text{gen}} - 2(\Sigma_{\text{real}}\Sigma_{\text{gen}})^{1/2})$ | 229 | | **CLIP FrΓ©chet Inception Distance (CLIP FID)** | A CLIP-based extension of FID for assessing generated textures. | $\text{CLIPFID} = \lVert \mu_{\text{CLIP,real}} - \mu_{\text{CLIP,gen}} \rVert^2 + \text{tr}(\Sigma_{\text{CLIP,real}} + \Sigma_{\text{CLIP,gen}} - 2(\Sigma_{\text{CLIP,real}} \Sigma_{\text{CLIP,gen}})^{1/2})$ | 230 | 231 | 232 | 233 | --- 234 | 235 | #### πŸ”„ Diversity and Multimodality 236 | 237 | These metrics assess whether the generative model produces diverse and varied outputs. 238 | 239 | 240 | | **Metric** | **Description** | **Formula** | 241 | |------------|-----------------|-------------| 242 | | **Diversity** | Quantifies variation between independently sampled subsets of generated outputs. | $\text{Diversity} = \frac{1}{N}\sum_{i=1}^{N}\lVert x_i - x'_i \rVert^2$ | 243 | | **Multimodality** | Measures diversity of outputs within the same action class. | $\text{Multimodality} = \frac{1}{C \cdot N}\sum_{c=1}^{C}\sum_{n=1}^{N}\lVert x_{c,n} - x'_{c,n} \rVert^2$ | 244 | | **Average Pairwise Distance (APD)** | Evaluates diversity across generated samples. | $\text{APD} = \frac{1}{N(N-1)}\sum_{i\neq j} \lVert x_i - x_j \rVert$ | 245 | 246 | --- 247 | 248 | #### 🎯 Relevance and Accuracy 249 | 250 | These metrics assess how well the generated content aligns with ground truth data. 251 | 252 | 253 | | **Metric** | **Description** | **Formula** | 254 | |------------|-----------------|-------------| 255 | | **Mean Absolute Joint Error (MAJE)** | Measures positional accuracy of generated motion. | $\text{MAJE} = \frac{1}{n}\sum_{i=1}^{n}\lvert x_i - y_i \rvert$ | 256 | | **Probability of Correct Keypoints (PCK)** | Evaluates the percentage of correct keypoint predictions. | $\text{PCK} = \frac{\text{number of correct keypoints}}{\text{number of total keypoints}}$ | 257 | | **Beat Consistency (BC)** | Measures alignment between motion and speech rhythms. | $\text{BC} = \frac{1}{T}\sum_{t=1}^{T}\cos\bigl(\text{motion-beats}(t), \text{speech-beats}(t)\bigr)$ | 258 | | **CLIP-Var** | Quantifies texture consistency across different views. | $\text{CLIP-Var} = 1 - \min_{i \neq j}\frac{f_i \cdot f_j}{\lVert f_i \rVert \lVert f_j \rVert}$ | 259 | | **Multimodal Distance (MM-Distance)** | Measures alignment between generated motion and textual descriptions. | $\text{MM-Distance} = \sqrt{\frac{1}{N}\sum_{n=1}^{N}\lVert f_{a,n} - f_{b,n} \rVert^2}$ | 260 | 261 | 262 | 263 | --- 264 | 265 | #### πŸƒ Physical Plausibility and Interaction 266 | 267 | These metrics assess whether generated motion adheres to real‑world physical constraints. 268 | 269 | | **Metric** | **Description** | **Formula** | 270 | |------------|-----------------|-------------| 271 | | **Foot Skating (FS)** | Detects unnatural foot movements in generated motion. | $\text{FS} = \frac{1}{T}\sum_{t=1}^{T}\lVert \text{foot-velocity}(t) - \text{expected-velocity}(t) \rVert$ | 272 | | **Mean Acceleration Difference (MAD)** | Evaluates smoothness of generated motion by comparing acceleration. | $\text{MAD} = \frac{1}{n}\sum_{i=1}^{n}\lVert a_i^{\text{gen}} - a_i^{\text{gt}} \rVert^2$ | 273 | 274 | 275 | --- 276 | 277 | #### ⚑️ Efficiency and Computational Metrics 278 | 279 | These metrics evaluate the computational cost of generative models. 280 | 281 | | **Metric** | **Description** | **Formula** | 282 | |------------|-----------------|-------------| 283 | | **Execution Time** | Measures the time required to generate outputs. | $\text{Execution Time} = \text{End Time} - \text{Start Time}$ | 284 | | **Kernel Inception Distance (KID)** | Measures output similarity using kernel functions. | $\text{KID} = \frac{1}{n(n-1)} \sum_{i \neq j} k(x_i, x_j) + \frac{1}{m(m-1)} \sum_{i \neq j} k(y_i, y_j) - \frac{2}{nm} \sum_{i=1}^{n} \sum_{j=1}^{m} k(x_i, y_j)$ 285 | | 286 | 287 | --- 288 | 289 | ## πŸ‘¨ Face 290 | Focuses on realistic face generation, facial reenactment, and attribute editing using GANs, diffusion models, and specialized frameworks. 291 | 292 | ### πŸ—‚ Datasets 293 | 294 | | 🏷️ Name | πŸ“Š Statistics | πŸ” Modalities | πŸ”— Link | 295 | | --- | --- | --- | --- | 296 | | RaFD | More than 8,000 images. Images of 67 models displaying eight facial expressions, photographed from five different angles. | πŸ–ΌοΈ Images | [RaFD](https://rafd.socsci.ru.nl/) | 297 | | MPIE | Over 750,000 images with a broad range of variations in facial expressions, head poses, and lighting conditions. | πŸ–ΌοΈ Images | [MPIE](https://www.cs.cmu.edu/afs/cs/project/PIE/MultiPie/Multi-Pie/Home.html) | 298 | | VoxCeleb1 | More than 100,000 utterances from 1,251 celebrities. | πŸ”Š Audio, πŸŽ₯ Video | [VoxCeleb1](https://www.robots.ox.ac.uk/~vgg/data/voxceleb/) | 299 | | VoxCeleb2 | Over 1 million utterances from 6,112 celebrities. | πŸ”Š Audio, πŸŽ₯ Video | [VoxCeleb2](https://www.robots.ox.ac.uk/~vgg/data/voxceleb/) | 300 | | CelebA-HQ | 30,000 images at a resolution of 1024Γ—1024, providing detailed facial images of celebrities. | πŸ–ΌοΈ Images | [CelebA-HQ](https://opendatalab.com/OpenDataLab/CelebA-HQ) | 301 | | FaceForensics | Over 1,000 video sequences with various face manipulations. | πŸŽ₯ Video | [FaceForensics](https://justusthies.github.io/posts/faceforensics/) | 302 | | 300-VW | About 300 videos of faces in various scenarios and lighting conditions. | πŸŽ₯ Video | [300-VW](https://ibug.doc.ic.ac.uk/resources/300-VW/) | 303 | | FFHQ | 70,000 images with extensive diversity, capturing various facial features, accessories, and environments. | πŸ–ΌοΈ Images | [FFHQ](https://www.computer.org/csdl/journal/tp/2021/12/08977347/1h2AHNHb9bW) | 304 | | AffectNet | Over 1 million images collected from the internet, with annotations for 11 different facial expressions and emotions. | πŸ–ΌοΈ Images | [AffectNet](http://mohammadmahoor.com/affectnet/) | 305 | | MΒ³ CelebA | Over 150K facial images annotated with semantic segmentation, facial landmarks, and captions in multiple languages. | πŸ–ΌοΈ Images, πŸ“ Text | [MΒ³ CelebA](https://huggingface.co/datasets/m3face/M3CelebA/viewer) | 306 | | CUB | Over 11,000 images of 200 bird species, each annotated with various attributes like species, part locations, and bounding boxes. | πŸ–ΌοΈ Images | [CUB](https://www.vision.caltech.edu/datasets/cub_200_2011/) | 307 | | CelebA-Dialog | 202,599 face images from 10,177 identities, annotated with 5 fine-grained attributes: Bangs, Eyeglasses, Beard, Smiling, Age, along with captions and user editing requests. | πŸ–ΌοΈ Images, πŸ“ Text | [CelebA-Dialog](https://mmlab.ie.cuhk.edu.hk/projects/CelebA/CelebA_Dialog.html) | 308 | | LS3D-W | A dataset of 230,000 3D facial landmarks. | πŸ–ΌοΈ Images | [LS3D-W](https://www.adrianbulat.com/face-alignment) | 309 | | MERL-RAV | Over 19,000 face images with diverse head pose, all annotated by 68 point landmarks and visibility status. | πŸ”Š Audio, πŸŽ₯ Video | [MERL-RAV](https://github.com/abhi1kumar/MERL-RAV_dataset) | 310 | | AFLW2000-3D | Contains 2000 images with 68-point 3D facial landmarks, used to evaluate 3D facial landmark detection models with diverse head poses. | πŸ–ΌοΈ Images, πŸ”· 3D/Point Cloud Data | [AFLW2000-3D](https://github.com/tensorflow/datasets/blob/master/docs/catalog/aflw2k3d.md) | 311 | | FaceScape | Over 18K textured 3D faces, captured from 938 subjects, each with 20 specific expressions. | πŸ”· 3D/Point Cloud Data | [FaceScape](https://facescape.nju.edu.cn/) | 312 | 313 | --- 314 | 315 | ### πŸ€– Models 316 | 317 | - **StyleGAN** [πŸ”—](https://arxiv.org/abs/1812.04948) 318 | *A generative adversarial network known for producing high-quality, photorealistic images. It serves as a backbone for many face generation and editing tasks.* 319 | 320 | - **ResNet** [πŸ”—](https://arxiv.org/abs/1512.03385) 321 | *A convolutional neural network architecture that provides robust feature extraction, often used as a backbone in face generation pipelines.* 322 | 323 | - **Dual-Generator (DG)** [πŸ”—](https://openaccess.thecvf.com/content/CVPR2022/papers/Hsu_Dual-Generator_Face_Reenactment_CVPR_2022_paper.pdf) 324 | *A large-pose face reenactment model composed of two modules: the ID-Preserving Shape Generator (IDSG), which uses 3D landmark detection to capture local shape variations, and the Reenacted Face Generator (RFG), based on StarGAN2, to produce the final output.* 325 | 326 | - **Feature Disentanglement and Identity Transfer Model** [πŸ”—](https://www.sciencedirect.com/science/article/abs/pii/S002002552200682X) 327 | *An approach that bypasses the need for pre-trained structural priors by using a Feature Disentanglement module with Feature Displacement Fields (FDF) and an Identity Transfer (IdT) module based on self-attention to align source identity with target attributes.* 328 | 329 | - **Unified Neural Face Reenactment Pipeline** [πŸ”—](https://openaccess.thecvf.com/content/ACCV2020/papers/Le_Minh_Ngo_Unified_Application_of_Style_Transfer_for_Face_Swapping_and_Reenactment_ACCV_2020_paper.pdf) 330 | *A pipeline that leverages a 3D shape model to obtain disentangled representations of pose, expression, and identity, mapping changes in these parameters to the latent space of a fine-tuned StyleGAN2 for accurate face reenactment.* 331 | 332 | - **Controllable 3D Generative Adversarial Face Model** [πŸ”—](https://arxiv.org/abs/2208.14263) 333 | *A model that employs a Supervised Auto-Encoder (SAE) to disentangle identity and expression into separate latent spaces, using a Conditional GAN (cGAN) for smooth and controllable expression intensity.* 334 | 335 | - **AlbedoGAN** [πŸ”—](https://openaccess.thecvf.com/content/WACV2024/papers/Rai_Towards_Realistic_Generative_3D_Face_Models_WACV_2024_paper.pdf) 336 | *A self-supervised 3D generative face model that synthesizes high-resolution albedo and detailed 3D geometry. It refines facial textures (e.g., wrinkles) via a mesh refinement displacement map integrated with the FLAME model, and leverages CLIP for text-guided editing.* 337 | 338 | - **IricGAN (Information Retention and Intensity Control GAN)** [πŸ”—](https://www.researchgate.net/publication/361317388_Face_editing_based_on_facial_recognition_features) 339 | *A face editing method designed to preserve identity and semantic details while enabling controlled modifications of facial attributes. It features a Hierarchical Feature Combination (HFC) module and an Attribute Regression Module (ARM) for smooth intensity control.* 340 | 341 | - **GSmoothFace** [πŸ”—](https://arxiv.org/abs/2312.07385) 342 | *A speech-driven talking face generation framework based on fine-grained 3D face modeling. It addresses lip synchronization and generalizability across speakers by introducing bias-based cross-attention and a Morphology Augmented Face Blending (MAFB) module.* 343 | 344 | - **Adaptive Latent Editing Model** [πŸ”—](https://arxiv.org/abs/2307.07790) 345 | *A face editing approach that uses adaptive and nonlinear latent space transformations to flexibly learn transformations for complex, conditional edits while maintaining image quality and realism.* 346 | 347 | - **StyleT2I** [πŸ”—](https://arxiv.org/abs/2203.15799) 348 | *A text-to-image synthesis model that improves compositionality and fidelity. It uses a CLIP-guided Contrastive Loss and a Text-to-Direction module to align StyleGAN’s latent codes with text descriptions, enhancing attribute control.* 349 | 350 | - **Hybrid Neural-Graphics Face Generation Model** [πŸ”—](https://dl.acm.org/doi/10.1145/3588432.3591563) 351 | *A model that combines neural networks (using StyleGAN2 for texture and background synthesis) with fixed-function graphics components (such as a differentiable renderer and the FLAME 3D head model) to achieve interpretable control over facial attributes.* 352 | 353 | - **M3Face** [πŸ”—](https://arxiv.org/abs/2402.02369) 354 | *A framework leveraging multimodal and multilingual inputs for both face generation and editing. It uses the Muse model to generate segmentation masks or landmarks from text and applies ControlNet architectures to refine the results, streamlining the process into a single step.* 355 | 356 | - **GuidedStyle** [πŸ”—](https://arxiv.org/abs/2012.11856) 357 | *A framework for semantic face editing on StyleGAN that employs a pre-trained attribute classifier as a knowledge network and sparse attention to guide layer-specific modifications, ensuring that only targeted facial features are changed.* 358 | 359 | - **AnyFace** [πŸ”—](https://arxiv.org/abs/2203.15334) 360 | *The first free-style text-to-face synthesis model capable of handling open-world text descriptions. It features a two-stream architecture that decouples text-to-face generation from face reconstruction, using CLIP-based cross-modal distillation and a Diverse Triplet Loss to enhance alignment and diversity.* 361 | 362 | - **HiFace** [πŸ”—](https://arxiv.org/abs/2303.11225) 363 | *A 3D face reconstruction model that decouples static (e.g., skin texture) and dynamic (e.g., wrinkles) details using its SD-DeTail Module. It extracts shape and detail coefficients via ResNet-50 and uses MLPs with AdaIN to generate detailed displacement maps for realistic reconstructions and animations.* 364 | 365 | --- 366 | 367 | ## πŸ˜ƒ Expression 368 | Covers emotion-driven synthesis, facial expression retargeting, and multimodal methods that capture nuanced nonverbal cues. 369 | 370 | ### πŸ—‚ Datasets 371 | 372 | | 🏷️ Name | πŸ“Š Statistics | πŸ” Modalities | πŸ”— Link | 373 | | --- | --- | --- | --- | 374 | | BEAT | 76 hours of speech data, paired with 52D facial blend shape weights; 30 speakers performing in 8 distinct emotional styles across 4 languages. | πŸ”Š Audio, πŸ–ΌοΈ Images, πŸŽ₯ Video, πŸ“ Text | [BEAT](https://pantomatrix.github.io/BEAT/) | 375 | | MEAD | A talking-face video corpus featuring 60 actors and actresses talking with eight different emotions at three intensity levels; approximately 40 hours of audio-visual clips per person and view. | πŸŽ₯ Video, πŸ”Š Audio, πŸ“ Text, πŸ–ΌοΈ Images | [MEAD](https://wywu.github.io/projects/MEAD/MEAD.html) | 376 | | TEAD | 50,000 quadruples, each including text, emotion tags, Action Units, blend shape weights, and situation sentences. | πŸ“ Text, πŸ–ΌοΈ Images | - | 377 | | JAFFE | 213 images of 10 Japanese female models posing 7 facial expressions, annotated with average semantic ratings from 60 annotators. | πŸ–ΌοΈ Images | [JAFFE](https://zenodo.org/records/3451524) | 378 | | MMI Facial Expression | Over 2900 videos and high-resolution still images of 75 subjects. | πŸŽ₯ Video, πŸ–ΌοΈ Images, πŸ“ Text | [MMI](https://mmifacedb.eu/) | 379 | | Multiface | High-quality recordings of the faces of 13 identities. An average of 23,000 frames per subject; each frame includes roughly 160 different camera views. | πŸ–ΌοΈ Images, πŸ”Š Audio, πŸ“‹ Tabular Data | [Multiface](https://github.com/facebookresearch/multiface) | 380 | | ICT FaceKit | 4,000 high-resolution facial scans of 79 subjects (34 female, 45 male) aged 18–67, plus 99 full-head scans and 26 expressions per subject. | πŸ”· 3D/Point Cloud Data, πŸ–ΌοΈ Images | [ICT FaceKit](https://github.com/ICT-VGL/ICT-FaceKit) | 381 | | TikTok Dataset | Over 300 single-person dance videos (10–15 seconds each), extracted at 30fps, yielding 100K+ frames. Includes segmented images and computed UV coordinates. | πŸŽ₯ Video, πŸ–ΌοΈ Images, πŸ“‹ Tabular Data | [TikTok Dataset](https://www.yasamin.page/hdnet_tiktok#h.jr9ifesshn7v) | 382 | | Everybody Dance Now | Long single-dancer videos for training and evaluation; includes both self-filmed videos and short YouTube videos. | πŸŽ₯ Video, πŸ“‹ Tabular Data | [Everybody Dance Now](https://carolineec.github.io/everybody_dance_now/) | 383 | | Obama Weekly Footage | 17 hours of video footage, nearly two million frames, spanning eight years. | πŸŽ₯ Video, πŸ”Š Audio | [Obama Weekly Footage](https://grail.cs.washington.edu/projects/AudioToObama/) | 384 | | VoxCeleb2 | Over 1 million utterances from over 6,000 speakers, collected from YouTube videos with 61% male speakers. | πŸ”Š Audio, πŸŽ₯ Video | [VoxCeleb2](https://www.robots.ox.ac.uk/~vgg/data/voxceleb/vox2.html) | 385 | | BIWI | Over 15K images of 20 people recorded with a Kinect while turning their heads around freely. | πŸ”· 3D/Point Cloud Data, πŸ–ΌοΈ Images, πŸ“‹ Tabular Data, πŸ“ Text | [BIWI](https://paperswithcode.com/dataset/biwi-kinect-head-pose) | 386 | | VOCASET | About 29 minutes of high-fidelity 4D scans captured at 60fps, synchronized with audio; features 12 speakers with 40 sequences per subject (each sequence consists of English sentences lasting 3–5 seconds). | πŸ”· 3D/Point Cloud Data, πŸ”Š Audio | [VOCASET](https://voca.is.tue.mpg.de/) | 387 | | SHOW | Contains SMPLX parameters of 4 persons reconstructed from videos; includes 88-frame motion clips for training and validation. | πŸŽ₯ Video, πŸ”Š Audio, πŸ–ΌοΈ Images, πŸ“‹ Tabular Data | [SHOW](https://github.com/yhw-yhw/SHOW) | 388 | 389 | 390 | --- 391 | 392 | ### πŸ€– Models 393 | 394 | #### πŸŽ™ Speech-Driven & Multimodal Expression Generation 395 | 396 | - **Joint Audio-Text Model for 3D Facial Animation** [πŸ”—](https://arxiv.org/abs/2112.02214) 397 | *Integrates a GPT-2-based text encoder with a dilated convolution audio encoder to improve upper-face expressiveness and lip synchronization. Lacks head and gaze control.* 398 | 399 | - **VOCA** [πŸ”—](https://arxiv.org/abs/1905.03079) 400 | *A speech-driven facial animation model used as a baseline for lip synchronization and expressiveness.* 401 | 402 | - **MeshTalk** [πŸ”—](https://arxiv.org/abs/2104.08223) 403 | *A model for speech-driven 3D facial animation, serving as a comparison baseline for upper-face motion and expressiveness.* 404 | 405 | - **CSTalk** [πŸ”—](https://arxiv.org/abs/2404.18604) 406 | *Employs a transformer-based encoder to capture correlations across facial regions, enhancing emotional speech-driven animation; limited to five discrete emotions.* 407 | 408 | - **ExpCLIP** [πŸ”—](https://arxiv.org/abs/2308.14448) 409 | *Aligns text, image, and expression embeddings via CLIP encoders, enabling expressive speech-driven facial animation from text/image prompts by leveraging the TEAD dataset and Expression Prompt Augmentation.* 410 | 411 | - **Style-Content Disentangled Expression Model** [πŸ”—](https://arxiv.org/abs/2412.14496) 412 | *Enhances personalization in facial animation by disentangling style and content representations, thereby improving identity retention and transition smoothness. (Compared to FaceFormer.)* 413 | 414 | - **FaceFormer** [πŸ”—](https://arxiv.org/abs/2112.05329) 415 | *A speech-driven facial animation model noted for its audio-visual synchronization, used as a baseline for comparison.* 416 | 417 | - **AdaMesh** [πŸ”—](https://arxiv.org/abs/2310.07236) 418 | *Introduces an Expression Adapter (MoLoRA-enhanced) and a Pose Adapter (retrieval-based) for personalized speech-driven facial animation, achieving improved expressiveness, diversity, and synchronization compared to models such as GeneFace and Imitator.* 419 | 420 | - **FaceXHuBERT** [πŸ”—](https://galib360.github.io/FaceXHuBERT/) 421 | *Explores disentangling emotional expressiveness through multimodal representations as part of advanced speech-driven facial animation.* 422 | 423 | - **FaceDiffuser** [πŸ”—](https://arxiv.org/abs/2309.11306) 424 | *Utilizes stochastic approaches to enhance motion variability and disentangle emotional expressiveness in facial animation.* 425 | 426 | 427 | #### πŸ” Expression Retargeting & Motion Transfer 428 | 429 | - **Neural Face Rigging (NFR)** [πŸ”—](https://arxiv.org/abs/2305.08296) 430 | *Automates 3D mesh rigging by encoding interpretable deformation parameters, enabling fine-grained facial expression transfer.* 431 | 432 | - **MagicPose** [πŸ”—](https://arxiv.org/abs/2311.12052) 433 | *Leverages diffusion models for 2D facial expression retargeting, balancing identity preservation and motion control through Multi-Source Attention and Pose ControlNet.* 434 | 435 | - **DiffSHEG** [πŸ”—](https://arxiv.org/abs/2401.04747) 436 | *Pioneers joint 3D facial expression and gesture synthesis with speech-driven alignment, employing Fast Out-Painting-based Partial Autoregressive Sampling (FOPPAS) for seamless, real-time motion generation.* 437 | 438 | - **DreamPose** [πŸ”—](https://grail.cs.washington.edu/projects/dreampose/) 439 | *A baseline model for 2D facial expression retargeting used for comparison with MagicPose.* 440 | 441 | - **Disco** [πŸ”—](https://arxiv.org/abs/2307.00040) 442 | *Serves as a comparison baseline in 2D facial expression retargeting, noted for its identity retention and generalization capabilities.* 443 | 444 | - **TalkSHOW** [πŸ”—](https://talkshow.is.tue.mpg.de/) 445 | *A speech-driven facial animation model referenced as a baseline for comparison with DiffSHEG.* 446 | 447 | - **LS3DCG** [πŸ”—](https://openaccess.thecvf.com/content/CVPR2024/papers/Chen_DiffSHEG_A_Diffusion-Based_Approach_for_Real-Time_Speech-driven_Holistic_3D_Expression_CVPR_2024_paper.pdf) 448 | *A model for 3D facial expression and gesture synthesis used as a baseline when comparing motion realism and synchronization.* 449 | 450 | - **DiffuseStyleGesture** [πŸ”—](https://arxiv.org/abs/2305.04919) 451 | *Referenced as a baseline model for facial expression and gesture synthesis in comparison to DiffSHEG.* 452 | 453 | --- 454 | 455 | ## πŸ–Ό Image 456 | Explores diffusion-based methods, VAEs, and other generative techniques to produce high-fidelity images and textures for animation backgrounds and elements. 457 | 458 | ### πŸ—‚ Datasets 459 | 460 | | 🏷️ Name | πŸ“Š Statistics | πŸ” Modalities | πŸ”— Link | 461 | | ------------------| --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------- | ------------------------------------------------------------------------------------------------ | 462 | | LAION-5B | 5,85 billion CLIP-filtered image-text pairs | πŸ–ΌοΈ Images, πŸ“ Text | [LAION-5B](https://laion.ai/blog/laion-5b/) | 463 | | LAION-400M | 400M English (image, text) pairs | πŸ–ΌοΈ Images, πŸ“ Text | [LAION-400M](https://laion.ai/blog/laion-400-open-dataset/) | 464 | | LAION-Aesthetics v2 | 1,2B aesthetics scores of β‰₯4.5
939M aesthetics scores of β‰₯4.75
600M aesthetics scores of β‰₯5
12M aesthetics scores of β‰₯6
3M aesthetics scores of β‰₯6.25
625K aesthetics scores of β‰₯6.5 | πŸ–ΌοΈ Images, πŸ“ Text | [LAION-Aesthetics v2](https://laion.ai/blog/laion-aesthetics/) | 465 | | Open Images V7 | 9M images annotated with image-level labels, object bounding boxes, object segmentation masks, visual relationships, and localized narratives | πŸ–ΌοΈ Images | [Open Images V7](https://storage.googleapis.com/openimages/web/index.html) | 466 | | COYO | 747M image-text pairs | πŸ–ΌοΈ Images, πŸ“ Text | [COYO](https://github.com/kakaobrain/coyo-dataset) | 467 | | Conceptual Captions | 3.3M images annotated with captions | πŸ–ΌοΈ Images, πŸ“ Text | [Conceptual Captions](https://ai.google.com/research/ConceptualCaptions/) | 468 | | COCO | 330K images (>200K labeled)
1.5 million object instances
80 object categories
91 stuff categories
5 captions per image
250,000 people with key points | πŸ–ΌοΈ Images, πŸ“ Text | [COCO](https://cocodataset.org) | 469 | | ShareGPT | 100k highly descriptive image-caption | πŸ–ΌοΈ Images, πŸ“ Text | [ShareGPT](https://huggingface.co/datasets/Lin-Chen/ShareGPT4V) | 470 | | ADE20K | 20,210 images in the training set
2,000 images in the validation set
3,000 images in the testing set | πŸ–ΌοΈ Images | [ADE20K](https://groups.csail.mit.edu/vision/datasets/ADE20K/) | 471 | 472 | 473 | --- 474 | 475 | ### πŸ€– Models 476 | 477 | 478 | #### πŸ”§ Fine-Tuning & Regularization 479 | 480 | - **Spectral Shift Fine-Tuning** [πŸ”—](https://arxiv.org/abs/2305.18670) 481 | *Introduces a compact parameter space called β€œspectral shift” for diffusion model fine-tuning. It reduces overfitting and storage inefficiency while achieving comparable or superior results in both single- and multi-subject generation. The method also employs the Cut Mix-Unmix data augmentation technique for improved multi-subject quality and acts as a regularizer enabling applications like single-image editing.* 482 | 483 | - **Control via Zero Convolutions (ControlNet)** [πŸ”—](https://arxiv.org/abs/2302.05543) 484 | *Addresses the limited spatial control of text-to-image models by locking large pre-trained diffusion models and reusing their deep encoding layers as a robust backbone. Connected via β€œzero convolutions” (zero-initialized convolution layers), this approach progressively grows parameters from zero to prevent harmful noise during fine-tuning, thereby facilitating diverse conditional controls.* 485 | 486 | 487 | #### βœ‚ Image Editing & Disentanglement 488 | 489 | - **Lightweight Disentanglement for Image Editing** [πŸ”—](https://ieeexplore.ieee.org/document/10175586) 490 | *Explores the inherent disentanglement properties of stable diffusion models. By partially replacing text embeddings from a style-neutral description with one that reflects the desired style, a lightweight algorithm (optimizing only 50 parameters) is introduced for improved style matching and content preservation, outperforming more complex fine-tuning baselines.* 491 | 492 | - **SmartEdit** [πŸ”—](https://openaccess.thecvf.com/content/CVPR2024/papers/Huang_SmartEdit_Exploring_Complex_Instruction-based_Image_Editing_with_Multimodal_Large_Language_CVPR_2024_paper.pdf) 493 | *Frames image editing as a supervised learning problem by generating a paired training dataset of text editing instructions with before/after images. Built on the Stable Diffusion framework, it successfully handles challenging edits such as object replacement, seasonal changes, background modifications, and alterations of material attributes or artistic mediums.* 494 | 495 | - **Classifier-Free Guidance** [πŸ”—](https://arxiv.org/pdf/2207.12598) 496 | *Employs a modified classifier-free guidance strategy in two ways: by introducing model-based classifier-free guidance and by planting a content β€œseed” early during denoising. Coupled with a patch-based fine-tuning strategy on latent diffusion models (LDMs), this approach enables generation at arbitrary resolutions while leveraging large pre-trained models.* 497 | 498 | - **Null Embedding Optimization for High-Fidelity Reconstructions** [πŸ”—](https://null-text-inversion.github.io/) 499 | *Observes that DDIM inversion provides a good starting point but struggles with classifier-free guidance. By optimizing the unconditional null embedding used in classifier-free guidance, this method achieves high-fidelity reconstructions without additional tuning of the model or conditional embeddings, thereby preserving editing capabilities.* 500 | 501 | - **Unified Diffusion Model Editing Algorithm** [πŸ”—](https://unified.baulab.info/) 502 | *Follows a three-stage approach: (i) optimizing text embeddings to match a given image, (ii) fine-tuning diffusion models for improved image alignment, and (iii) linearly interpolating between optimized and target text embeddings. This unified algorithm enables precise editing of diffusion models, aiming to make them more responsible and beneficial.* 503 | 504 | - **Debiasing Text-to-Image Diffusion Models** [πŸ”—](https://arxiv.org/html/2402.14577v1) 505 | *Enables targeted debiasing, removal of potentially copyrighted content, and moderation of offensive concepts using only text descriptions. This editing methodology can be applied to any linear projection layer by replacing pre-trained weights while preserving key concepts.* 506 | 507 | #### πŸ‘½ Multimodal Conversations & Visual Understanding 508 | 509 | - **AlignGPT** [πŸ”—](https://arxiv.org/abs/2405.14129) 510 | *Comprises a multimodal large language model (MLLM) for enhanced multimodal perception. An accompanying AlignerNet bridges the MLLM to the diffusion U-Net image decoder, enabling coherent integration of textual and visual information.* 511 | 512 | - **KOSMOS-G** [πŸ”—](https://arxiv.org/abs/2310.02992) 513 | *Offers seamless concept-level guidance from interleaved input to the image decoder. Serving as an alternative to CLIP, it facilitates effective image generation by guiding the diffusion process with interleaved multimodal cues.* 514 | 515 | - **MM-REACT** [πŸ”—](https://arxiv.org/abs/2303.11381) 516 | *Presents a unified approach that synergizes multimodal reasoning and action to tackle complex visual understanding tasks. Extensive zero-shot experiments demonstrate its capabilities in multi-image reasoning, multi-hop document understanding, and open-world concept comprehension.* 517 | 518 | --- 519 | 520 | ## πŸ‘€ Avatar 521 | Reviews approaches for both 2D and 3D avatar creation, emphasizing lifelike digital representations with detailed facial expressions and body dynamics. 522 | 523 | ### πŸ—‚ Datasets 524 | 525 | | 🏷️ Name | πŸ“Š Statistics | πŸ” Modalities | πŸ”— Link | 526 | | ------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------ | -------------------------------------------------------------------- | 527 | | WildAvatar | Over 10,000 human subjects; extracted from YouTube; significantly richer than previous datasets for 3D human avatar creation | πŸŽ₯ Video, πŸ”· 3D/Point Cloud Data, πŸ”Š Audio | [WildAvatar](https://wildavatar.github.io/) | 528 | | ZJU-MoCap | Multi-camera system with 20+ synchronized cameras; includes SMPL-X parameters for detailed motion capture of body, hand, and face; complex actions such as twirling, Taichi, and punching | πŸŽ₯ Video, πŸ”· 3D/Point Cloud Data | [ZJU-MoCap](https://chingswy.github.io/Dataset-Demo/) | 529 | | TalkSHOW | 26.9 hours of in-the-wild talking videos from 4 speakers; expressive 3D whole-body meshes reconstructed at 30 fps, synchronized with audio at 22 kHz | πŸ”Š Audio, πŸ”· 3D/Point Cloud Data | [TalkSHOW](https://talkshow.is.tue.mpg.de/) | 530 | | HuMMan | 1,000 human subjects, 400k sequences, 60M frames; include point clouds, SMPL parameters, and textured meshes for multimodal sensing | πŸŽ₯ Video, πŸ”· 3D/Point Cloud Data | [HuMMan](https://caizhongang.com/projects/HuMMan/) | 531 | | BUFF | 6 subjects performing motions in two clothing styles; 13,632 3D scans with high-resolution ground-truth minimally-clothed shapes | πŸ”· 3D/Point Cloud Data | [BUFF](https://buff.is.tue.mpg.de/) | 532 | | AMASS | Combines 15 motion capture datasets into a unified framework with over 42 hours of motion data; 346 subjects and 11,451 motions with SMPL pose parameters, 3D shape parameters, and soft-tissue coefficients | πŸ”· 3D/Point Cloud Data | [AMASS](https://amass.is.tue.mpg.de/) | 533 | | 3DPW | 60 video sequences with accurate 3D poses using video and IMU data; 18 re-poseable 3D body models with different clothing variations | πŸŽ₯ Video, ⏱️ Time-Series Data, πŸ”· 3D/Point Cloud Data | [3DPW](https://virtualhumans.mpi-inf.mpg.de/3DPW/) | 534 | | AIST++ | 10,108,015 frames of 3D key points with corresponding images; 1,408 dance motion sequences spanning 10 dance genres with synchronized music | πŸŽ₯ Video, πŸ”Š Audio, πŸ”· 3D/Point Cloud Data | [AIST++](https://google.github.io/aistplusplus_dataset/) | 535 | | RenderMe-360 | Over 243 million head frames from 500 identities; includes FLAME parameters, UV maps, action units, textured meshes, and diverse annotations | πŸŽ₯ Video, πŸ”· 3D/Point Cloud Data | [RenderMe-360](https://renderme-360.github.io/) | 536 | | PuzzleIOI | 41 subjects with nearly 1,000 Outfit-of-the-Day (OOTD) configurations; includes paired ground-truth 3D body scans for challenging partial photos | πŸ–ΌοΈ Images, πŸ”· 3D/Point Cloud Data, πŸ“ Text | [PuzzleIOI](https://puzzleavatar.is.tue.mpg.de/) | 537 | 538 | 539 | --- 540 | 541 | ### πŸ€– Models 542 | 543 | 544 | #### πŸ” CLIP-Guided Models 545 | 546 | - **AvatarCLIP** [πŸ”—](https://hongfz16.github.io/projects/AvatarCLIP.html) 547 | *A zero-shot framework for generating and animating 3D avatars from natural language descriptions. It uses a shape VAE for initial geometry generation guided by CLIP and integrates NeuS for high-quality geometry and photorealistic rendering. In the motion phase, candidate poses are selected via CLIP and a motion VAE synthesizes smooth motions.* 548 | 549 | - **DreamField** [πŸ”—](https://ajayj.com/dreamfields) 550 | *Adapts NeRF for text-driven 3D object generation. While it facilitates text-to-3D synthesis, it struggles with capturing detailed geometry.* 551 | 552 | - **Text2Mesh** [πŸ”—](https://threedle.github.io/text2mesh/) 553 | *Stylizes existing meshes using CLIP guidance. It aims for text-driven mesh modifications but faces challenges with stability and flexibility when handling diverse text descriptions.* 554 | 555 | 556 | #### 🧩 Implicit Function-Based Models 557 | 558 | - **PIFu (Pixel-Aligned Implicit Function)** [πŸ”—](https://shunsukesaito.github.io/PIFu/) 559 | *Reconstructs detailed 3D surfaces from single-view 2D images by projecting 3D points into 2D space to extract pixel-aligned features via CNNs, which are then processed by an MLP for high-resolution surface reconstructions.* 560 | 561 | - **PIFuHD** [πŸ”—](https://shunsukesaito.github.io/PIFuHD/) 562 | *Enhances PIFu by incorporating multi-scale feature extraction, leading to improved global shape understanding and finer surface details.* 563 | 564 | - **ARCH (Animatable Reconstruction of Clothed Humans)** [πŸ”—](https://arxiv.org/abs/2004.04572) 565 | *Reconstructs detailed 3D models of clothed individuals from single RGB images. It transforms poses into a canonical space using a parametric body model and employs an implicit surface representation to capture fine details such as clothing folds.* 566 | 567 | - **ARCH++** [πŸ”—](https://arxiv.org/abs/2108.07845) 568 | *An enhanced version of ARCH that refines geometry encoding and boosts clothing details to produce photorealistic, animatable avatars.* 569 | 570 | - **PaMIR (Parametric Model-Conditioned Implicit Representation)** [πŸ”—](https://arxiv.org/abs/2007.03858) 571 | *Combines a parametric SMPL body model with an implicit surface representation to reconstruct 3D humans from single RGB images. It uses a depth-ambiguity-aware loss and refines SMPL parameters during inference for better alignment.* 572 | 573 | - **TADA (Text to Animatable Dynamic Avatar)** [πŸ”—](https://tada.is.tue.mpg.de/) 574 | *Generates high-fidelity, animatable 3D avatars directly from text prompts. It leverages an upsampled SMPL-X model and learnable displacements, optimizing geometry and texture via Score Distillation Sampling losses, with additional detail enhancement through partial mesh subdivision.* 575 | 576 | - **GETAvatar (Generative Textured Meshes for Animatable Human Avatars)** [πŸ”—](https://openaccess.thecvf.com/content/ICCV2023/papers/Zhang_GETAvatar_Generative_Textured_Meshes_for_Animatable_Human_Avatars_ICCV_2023_paper.pdf) 577 | *Directly produces high-fidelity, explicitly textured 3D meshes. It represents human bodies using an articulated 3D mesh and generates a signed distance field (SDF) in canonical space, which is deformed to match the target shape and pose via SMPL-based transformations. A normal field trained on 3D scans enhances fine geometric details.* 578 | 579 | - **RodinHD** [πŸ”—](https://rodinhd.github.io/) 580 | *Creates 3D avatars from a single portrait image by constructing a detailed 3D blueprint (triplane) that captures the avatar’s shape, textures, and fine details. A shared neural decoder then converts this blueprint into an image, with a cascaded diffusion model generating new triplanes based on the portrait.* 581 | 582 | #### πŸŽ₯ NeRF-Based Methods 583 | 584 | - **HumanNeRF** [πŸ”—](https://grail.cs.washington.edu/projects/humannerf/) 585 | *Pioneers the use of deformation fields for dynamic human models from monocular images, enabling the mapping of points from observation to canonical space.* 586 | 587 | - **Neural Body** [πŸ”—](https://zju3dv.github.io/neuralbody/) 588 | *Introduces structured latent codes anchored to SMPL model vertices, processed via SparseConvNet, to regularize dynamic human modeling.* 589 | 590 | - **Neural Human Performer** [πŸ”—](https://youngjoongunc.github.io/nhp/) 591 | *Captures dynamic human information directly in the observation space using a skeletal feature bank and transformer modules.* 592 | 593 | - **Vid2Avatar** [πŸ”—](https://moygcc.github.io/vid2avatar/) 594 | *Jointly models human subjects and scene backgrounds using two separate neural radiance fields, enhancing realism in avatar generation.* 595 | 596 | - **DreamHuman** [πŸ”—](https://dream-human.github.io/) 597 | *Generates animatable 3D human avatars from textual descriptions by combining NeRF with the imGHUM body model. It uses human body shape statistics for anatomical correctness and incorporates semantic zooming for detailed regions such as faces and hands.* 598 | 599 | 600 | #### 🌈 Diffusion-Based Methods 601 | 602 | - **Personalized Avatar Scene (PAS)** [πŸ”—] 603 | *Generates customized 3D avatars in various poses and scenes based on text descriptions. It employs a diffusion-based transformer to generate 3D body poses conditioned on text.* 604 | 605 | - **3D Head Avatar via 3DMM & Diffusion** [πŸ”—](https://arxiv.org/abs/2307.04859) 606 | *Combines a parametric 3D Morphable Model of the head (using FLAME [153]) with diffusion models to jointly optimize geometry and texture for generating 3D head avatars from text prompts.* 607 | 608 | - **Make-Your-Anchor** [πŸ”—](https://arxiv.org/abs/2403.16510) 609 | *Introduces a novel approach for generating 2D anchor-style avatars capable of realistic full-body motion and expression. It utilizes a Structure-Guided Diffusion Model (SGDM) to ensure coherent and expressive avatar generation.* 610 | 611 | 612 | #### πŸ”€ Hybrid Methods 613 | 614 | - **DreamAvatar** [πŸ”—](https://yukangcao.github.io/DreamAvatar/) 615 | *Integrates shape priors, diffusion models, and NeRF architecture within a dual-observation-space (DOS) framework. Leveraging SMPL for anatomical guidance and employing joint optimization with specialized head-focused VSD loss (using ControlNet [310]), it ensures structurally consistent avatars with controllable shape modifications. While it outperforms methods like DreamWaltz [111] in geometric accuracy, it currently lacks animation capabilities and may inherit biases from pretrained diffusion models.* 616 | 617 | - **DreamWaltz** [πŸ”—](https://idea-research.github.io/DreamWaltz/) 618 | *Referenced as a comparative baseline, this model illustrates limitations in animation capabilities and inherited biases when compared to hybrid approaches like DreamAvatar.* 619 | 620 | --- 621 | 622 | ## 🀝 Gesture 623 | Examines methods for generating human-like gestures and co-speech movements, critical for interactive and immersive animations. 624 | 625 | ### πŸ—‚ Datasets 626 | 627 | | 🏷️ Name | πŸ“Š Statistics | πŸ” Modalities | πŸ”— Link | 628 | | ------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------- | 629 | | IEMOCAP | 151 recorded dialogue videos, with 2 speakers per session, totaling 302 videos. Annotated for 9 emotions and valence, arousal, and dominance. Contains approximately 12 hours of audiovisual data. | πŸŽ₯ Video, πŸ”Š Audio, πŸ“ Text, πŸ“‹ Tabular Data | [IEMOCAP](https://sail.usc.edu/iemocap/) | 630 | | SaGA | 25 dialogues between interlocutors (50 total). Language: German. Published speakers: 6, unpublished speakers: 19. Annotated gestures: 1,764 (total corpus). Total video duration: 1 hour. | πŸŽ₯ Video, πŸ”Š Audio, πŸ“‹ Tabular Data | [SaGA](https://www.phonetik.uni-muenchen.de/Bas/BasSaGAeng.html) | 631 | | Creative-IT | Data from 16 actors (male and female). Affective dyadic interactions range from 2 to 10 minutes each. Approximately 8 sessions of audiovisual data were released. | πŸŽ₯ Video, πŸ”Š Audio, πŸ“ Text, πŸ“‹ Tabular Data | [CreativeIT](https://sail.usc.edu/CreativeIT/ImprovRelease.html) | 632 | | CMU Panoptic | 3D facial landmarks from 65 sequences (5.5 hours). Contains 1.5 million 3D skeletons. | πŸŽ₯ Video, πŸ”Š Audio, πŸ“ Text | [CMU Panoptic](http://domedb.perception.cs.cmu.edu/) | 633 | | Speech-Gesture | A 144-hour dataset featuring 10 speakers. Includes frame-by-frame, automatically detected pose annotations. | πŸŽ₯ Video, πŸ”Š Audio | [Speech-Gesture](https://people.eecs.berkeley.edu/~shiry/projects/speech2gesture/) | 634 | | Talking With Hands 16.2M | 16.2 million frames (50 hours) of two-person, face-to-face spontaneous conversations. Strong covariance in arm and hand features. | πŸŽ₯ Video, πŸ”Š Audio | [Talking With Hands](https://github.com/facebookresearch/TalkingWithHands32M) | 635 | | PATS | 25 speakers, 251 hours of data, approximately 84,000 intervals. Mean interval length: 10.7 seconds. | πŸŽ₯ Video, πŸ”Š Audio, πŸ“ Text | [PATS](http://chahuja.com/pats/) | 636 | | Trinity Speech-Gesture II | 244 minutes of motion capture and audio (23 takes). Includes one male native English speaker. The skeleton consists of 69 joints. | πŸŽ₯ Video, πŸ”Š Audio, πŸ“‹ Tabular Data | [Trinity](https://trinityspeechgesture.scss.tcd.ie/) | 637 | | SaGA++ | 25 recordings, totaling 4 hours of data. | πŸŽ₯ Video, πŸ”Š Audio, πŸ“ Text, πŸ“‹ Tabular Data | [SaGA++](https://svito-zar.github.io/speech2properties2gestures/) | 638 | | ZEGGS | 67 monologue sequences with 19 different motion styles. Performed by a female actor speaking in English. Total duration: 134.65 minutes. | πŸŽ₯ Video, πŸ”Š Audio | [ZEGGS](https://github.com/ubisoft/ubisoft-laforge-ZeroEGGS) | 639 | | BEAT | 76 hours of 3D motion capture data from 30 speakers. Covers 8 emotions and 4 languages. Includes 32 million frame-level emotion and semantic relevance annotations. | πŸŽ₯ Video, πŸ”Š Audio, πŸ“ Text, πŸ“‹ Tabular Data | [BEAT](https://pantomatrix.github.io/BEAT/) | 640 | | BEAT2 | 60 hours of mesh-level, motion-captured co-speech gesture data. Integrates SMPL-X body and FLAME head parameters. Enhances modeling of head, neck, and finger movements. | πŸŽ₯ Video, πŸ”Š Audio, πŸ“‹ Tabular Data | [BEAT2](https://pantomatrix.github.io/EMAGE/) | 641 | | GAMT | 176 video clips of volunteers using math terms and gestures. Covers 8 classes of mathematical terms and gestures. | πŸŽ₯ Video, πŸ”Š Audio, πŸ“ Text | [GAMT](https://openaccess.thecvf.com/content/CVPR2024W/MAR/html/Maidment_Using_Language-Aligned_Gesture_Embeddings_for_Understanding_Gestures_Accompanying_Math_Terms_CVPRW_2024_paper.html) | 642 | | SeG | 208 types of global semantic gestures. 544 motion files recorded from a male performer. Each gesture is represented in 2.6 variations on average. | πŸŽ₯ Video, πŸ”Š Audio, πŸ“‹ Tabular Data | [SeG](https://pku-mocca.github.io/Semantic-Gesticulator-Page/) | 643 | | DND Group Gesture | 6 hours of gesture data from 5 individuals playing Dungeons & Dragons. Recorded over 4 sessions (total duration: 6 hours). Includes beat, iconic, deictic, and metaphoric gestures. | πŸŽ₯ Video, πŸ”Š Audio, πŸ“‹ Tabular Data | [DND](https://github.com/m-hamza-mughal/convofusion) | 644 | 645 | 646 | 647 | --- 648 | 649 | ### πŸ€– Models 650 | 651 | 652 | #### πŸ›  Traditional & Parametric Approaches 653 | 654 | - **Parameter-Based Procedural Animation** [πŸ”—](https://www4.nccu.edu.tw/~tyli/pdf/iva2009.pdf) 655 | *Uses high-level control parameters (e.g., emotion, speech intensity, rhythm) to select and interpolate predefined keyframes, yielding smooth and coherent gesture sequences.* 656 | 657 | - **Blendshape Models** [πŸ”—](https://graphics.cs.uh.edu/wp-content/papers/2014/2014-EG-blendshape_STAR.pdf) 658 | *Generates detailed hand and finger gestures by blending a set of predefined base shapes using weighted interpolation, enabling fine-grained control and smooth transitions.* 659 | 660 | 661 | #### 🧠 Deep Learning-Based Models 662 | 663 | - **GestureGAN** [πŸ”—](https://arxiv.org/abs/1808.04859) 664 | *Employs a GAN-based generator-discriminator framework to synthesize realistic gesture sequences conditioned on audio inputs, capturing dynamic hand gestures effectively.* 665 | 666 | - **Speech2Gesture** [πŸ”—](https://shiry.ttic.edu/projects/speech2gesture/) 667 | *Generates co-speech gestures directly from speech features using LSTM/RNN architectures, effectively modeling temporal dependencies between speech and gesture.* 668 | 669 | - **StyleGestures** [πŸ”—](https://diglib.eg.org/items/04569b85-1067-4ad0-9b32-46646ecdba66) 670 | *Utilizes an encoder-decoder architecture with style tokens and Transformers to capture individual speaker styles, enabling personalized gesture synthesis.* 671 | 672 | - **Audio-Driven Adversarial Gesture Generation** [πŸ”—](https://arxiv.org/abs/2303.09119) 673 | *Combines GANs with Conditional Variational Autoencoders (CVAE) to align audio and motion features in a shared latent space, resulting in nuanced, audio-driven gestures.* 674 | 675 | - **GestureDiffuCLIP** [πŸ”—](https://pku-mocca.github.io/GestureDiffuCLIP-Page/) 676 | *Leverages a diffusion process guided by CLIP for semantic alignment to iteratively refine gesture sequences, producing highly expressive gestures.* 677 | 678 | - **ZeroEGGS** [πŸ”—](https://arxiv.org/abs/2209.07556) 679 | *Implements a zero-shot paradigm for generating gestures based solely on speech, using example-based learning to generalize across unseen gestural styles.* 680 | 681 | - **GestureMaster** [πŸ”—](https://dl.acm.org/doi/10.1145/3536221.3558063) 682 | *Utilizes a graph neural network (GNN) framework to model spatial and temporal dependencies in gesture sequences, enhancing naturalistic hand and body gesture synthesis.* 683 | 684 | - **ExpressGesture** [πŸ”—](https://onlinelibrary.wiley.com/doi/10.1002/cav.2016) 685 | *Integrates emotion recognition with gesture generation pipelines, creating gestures that reflect both the content of speech and underlying sentiment.* 686 | 687 | - **MocapNET** [πŸ”—](https://github.com/FORTH-ModelBasedTracker/MocapNET) 688 | *Bridges traditional motion capture with neural synthesis by combining 2D pose estimation and 3D gesture reconstruction using multimodal motion capture datasets.* 689 | 690 | - **CSMP** [πŸ”—](https://camp-nerf.github.io/) 691 | *A diffusion-based co-speech gesture generation model that leverages joint text and audio representations to capture intricate inter-modal relationships.* 692 | 693 | - **ZS-MSTM** [πŸ”—](https://arxiv.org/abs/2305.12887) 694 | *Introduces a zero-shot style transfer method for gesture animation through adversarial disentanglement, separating style and content features for effective style transfer across speakers.* 695 | 696 | 697 | #### πŸš€ Transformer-Based Models 698 | 699 | - **Gesticulator** [πŸ”—](https://svito-zar.github.io/gesticulator/) 700 | *Employs a multimodal Transformer architecture to generate contextually relevant gestures conditioned on both text and audio inputs, aligning with co-speech dynamics.* 701 | 702 | - **Mix-StAGE** [πŸ”—](https://chahuja.com/mix-stage/) 703 | *Uses an attention-based encoder-decoder with a style encoder and mixed spatial-temporal attention mechanisms to capture dynamic, expressive gestures.* 704 | 705 | - **SAGA (Style and Grammar-Aware Gesture Generation)** [πŸ”—](https://arxiv.org/abs/2307.09597) 706 | *Combines an LSTM-based encoder-decoder with a Transformer-based grammar encoder to align gestures accurately with linguistic content, integrating both style and grammatical cues.* 707 | 708 | - **Cross-Modal Transformer** [πŸ”—](https://arxiv.org/abs/2301.01283) 709 | *Leverages cross-attention mechanisms to fuse diverse modalities (text, audio, video), enhancing the coherence and contextual alignment of generated gestures.* 710 | 711 | - **DiM-Gesture** [πŸ”—](https://arxiv.org/abs/2408.00370) 712 | *Introduces an adaptive layer normalization mechanism (Mamba-2) to adjust to different speakers, focusing on generating realistic co-speech gestures from audio.* 713 | 714 | - **AMUSE** [πŸ”—](https://amuse.is.tue.mpg.de/) 715 | *Utilizes a disentangled latent diffusion technique to separate emotional expressions from gestures, enabling control over emotional aspects via a multi-stage training pipeline.* 716 | 717 | - **FreeTalker** [πŸ”—](https://youngseng.github.io/FreeTalker/) 718 | *Employs a diffusion-based framework with classifier-free guidance and a generative prior (DoubleTake) to produce natural transitions between gesture clips, extending beyond co-speech gestures.* 719 | 720 | - **CoCoGesture** [πŸ”—](https://mattie-e.github.io/GES-X/) 721 | *Addresses long-sequence gesture generation with a Transformer-based diffusion model that uses a large dataset (GES-X) and a mixture-of-experts framework to effectively align gestures with human speech.* 722 | 723 | - **DiffuseStyleGestures** [πŸ”—](https://arxiv.org/abs/2305.04919) 724 | *Integrates audio, text, speaker IDs, and seed gestures within a diffusion-based approach to produce stylistically diverse co-speech gesture outputs.* 725 | 726 | - **DiffuseStyleGesture+** [πŸ”—](https://arxiv.org/abs/2308.13879) 727 | *Builds upon DiffuseStyleGestures by further refining gesture synthesis through advanced multimodal integration and specialized attention mechanisms for personalized outputs.* 728 | 729 | - **ViTPose** [πŸ”—](https://arxiv.org/abs/2204.12484) 730 | *Applies Vision Transformers to human pose estimation, providing a robust foundation for gesture synthesis by accurately capturing pose dynamics.* 731 | 732 | - **Gesture Motion Graphs** [πŸ”—](https://dl.acm.org/doi/10.1145/3577190.3616118) 733 | *Utilizes graph-based modeling for few-shot gesture reenactment, effectively representing motion sequences and their dependencies.* 734 | 735 | - **DiffSHEG** [πŸ”—](https://jeremycjm.github.io/proj/DiffSHEG/) 736 | *Adopts a diffusion-based approach for real-time speech-driven 3D expression and gesture generation, leveraging joint text and audio representations for coherent outputs.* 737 | 738 | - **C2G2** [πŸ”—](https://arxiv.org/abs/2308.15016) 739 | *Emphasizes controllability in co-speech gesture generation by using modular components to handle different aspects of gesture synthesis.* 740 | 741 | - **DiffuGesture** [πŸ”—](https://dl.acm.org/doi/10.1145/3610661.3616552) 742 | *Focuses on generating gestures for two-person dialogues with specialized diffusion techniques tailored for interactive and conversational settings.* 743 | 744 | --- 745 | 746 | ## πŸŽ₯ Motion 747 | Highlights text-constrained motion generation techniques, including MotionGPT and diffusion frameworks, for creating smooth and realistic animation sequences. 748 | 749 | ### πŸ—‚ Datasets 750 | 751 | 752 | | 🏷️ Name | πŸ“Š Statistics | πŸ” Modalities | πŸ”— Link | 753 | | ------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------- | ------------------------------------------------------------------------------------------------- | 754 | | Motion-X++ | 19.5 million 3D poses across 120,500 sequences, synchronized with 80,800 RGB videos and 45,300 audio tracks. Annotated with free-form text descriptions. | πŸ”· 3D/Point Cloud Data, πŸ“ Text, πŸ”Š Audio, πŸŽ₯ Video | [Motion-X++](https://github.com/IDEA-Research/Motion-X) | 755 | | HumanMM (ms-Motion) | 120 long-sequence 3D motions reconstructed from 500 in-the-wild multi-shot videos, totaling 60 hours of data. Includes rare interactions. | πŸ”· 3D/Point Cloud Data, πŸŽ₯ Video | [HumanMM](https://github.com/zhangyuhong01/HumanMM-code) | 756 | | Multimodal Anatomical Motion | 51,051 annotated poses with 53 anatomical landmarks, captured across 48 virtual camera views per pose. Includes 2,000+ pathological motion variations. | πŸ”· 3D/Point Cloud Data, πŸ“ Text | - | 757 | | AMASS | 11,265 motion clips aggregated from 15 mocap datasets (e.g., CMU, KIT), totaling 43 hours of motion data in SMPL format. Covers 100+ action categories. | πŸ”· 3D/Point Cloud Data | [AMASS](https://amass.is.tue.mpg.de/) | 758 | | HumanML3D | 14,616 motion sequences (28.6 hours) paired with 44,970 free-form text descriptions spanning 200+ action categories. | πŸ”· 3D/Point Cloud Data, πŸ“ Text | [HumanML3D](https://github.com/EricGuo5513/HumanML3D) | 759 | | BABEL | 43 hours of motion data from AMASS, annotated with 250+ verb-centric action classes across 13,220 sequences. Includes temporal action boundaries. | πŸ”· 3D/Point Cloud Data, πŸ“ Text | [BABEL](https://babel.is.tue.mpg.de/) | 760 | | AIST++ | 1,408 dance sequences (10.1 million frames) captured from 9 camera views, totaling 15 hours of multi-view RGB video data. | πŸ”· 3D/Point Cloud Data, πŸŽ₯ Video | [AIST++](https://google.github.io/aistplusplus_dataset/) | 761 | | 3DPW | 60 sequences (51,000 frames) captured in diverse indoor/outdoor environments, featuring challenging poses and natural object interactions. | πŸ”· 3D/Point Cloud Data, πŸŽ₯ Video | [3DPW](https://virtualhumans.mpi-inf.mpg.de/3DPW/) | 762 | | PROX | 20 subjects performing 12 interactive scenarios in 3D scenes, including 180 annotated RGB frames for scene-aware motion analysis. | πŸ”· 3D/Point Cloud Data, πŸ–ΌοΈ Images | [PROX](https://prox.is.tue.mpg.de/index.html) | 763 | | KIT-ML | 3,911 motion clips (11.23 hours) with 6,278 natural language annotations containing 52,903 words, stored in BVH/FBX formats. | πŸ”· 3D/Point Cloud Data, πŸ“ Text | [KIT-ML](https://git.h2t.iar.kit.edu/sw/motion-annotation) | 764 | 765 | 766 | 767 | --- 768 | 769 | ### πŸ€– Models 770 | 771 | 772 | #### πŸ”€ Language-to-Pose Models 773 | 774 | - **Language2Pose** [πŸ”—](https://chahuja.com/language2pose/) 775 | *Generates 3D human poses directly from natural language. It employs two encoders (for text and 3D motion) that map inputs into a joint embedding space, and a decoder that produces a fixed-length motion sequence by minimizing the distance between corresponding text and motion embeddings.* 776 | 777 | - **MotionClip** [πŸ”—](https://guytevet.github.io/motionclip-page/) 778 | *An end-to-end pipeline for motion generation based on an encoder-decoder transformer. The model extracts a high-level motion representation and uses multiple loss functionsβ€”comparing joint orientations, velocities, and image-text groundings via CLIPβ€”to enhance motion quality.* 779 | 780 | 781 | #### πŸ“¦ Variational Auto-Encoder (VAE) Based Models 782 | 783 | - **ACTOR** [πŸ”—](https://arxiv.org/abs/2104.05670) 784 | *Generates diverse and realistic 3D human motions conditioned on action labels. This approach uses a transformer-based VAE to encode actions and poses into a Gaussian latent space, allowing sampling of varied motions for the same action prompt.* 785 | 786 | - **TEMOS (Text-To-Motions)** [πŸ”—](https://arxiv.org/abs/2204.14109) 787 | *A text-conditioned generative model that uses a transformer-based VAE architecture with two symmetric encoders (for motion and text). It learns a diverse latent space by aligning text and pose embeddings to generate meaningful SMPL body motions.* 788 | 789 | - **Teach** [πŸ”—](https://arxiv.org/abs/2209.04066) 790 | *Transforms sequences of text descriptions into SMPL body motions. The model operates non-autoregressively within individual actions and autoregressively across action sequences by leveraging a past-conditioned text encoder that combines historical motion features with current text input.* 791 | 792 | - **Generating Diverse and Natural 3D Human Motions from Text** [πŸ”—](https://ericguo5513.github.io/text-to-motion/) 793 | *Generates 3D human motions from textual descriptions by first pre-training an auto-encoder (using convolutional and deconvolutional layers) and then utilizing a temporal VAE with three recurrent networks (prior, posterior, and generator) to produce motion snippets.* 794 | 795 | - **TMR** [πŸ”—](https://mathis.petrovich.fr/tmr/) 796 | *Enhances transformer-based text-to-motion generation by mapping motion and text embeddings into a joint space. Dual transformer encoders are used for each modality, and cosine similarity between embeddings is maximized for positive pairs while filtering out negatives via MPNet similarity.* 797 | 798 | 799 | #### πŸ— VQ-VAE Based Models 800 | 801 | - **T2M GPT** [πŸ”—](https://mael-zys.github.io/T2M-GPT/) 802 | *The first model to apply VQ-VAE for motion generation. It learns a discrete representation (codebook) of motion and formulates motion generation as an autoregressive token prediction task, conditioned on text encoded by CLIP.* 803 | 804 | - **DiverseMotion** [πŸ”—](https://arxiv.org/abs/2309.01372) 805 | *Builds upon T2M GPT by discarding the autoregressive generation in favor of a diffusion process to diversify motion outputs. It employs CLIP for text encoding and Hierarchy Semantic Aggregation (HSA) to generate a richer holistic text embedding.* 806 | 807 | - **MoMask** [πŸ”—](https://ericguo5513.github.io/momask/) 808 | *Uses a hierarchical VQ-VAE to quantize motion sequences into discrete tokens over multiple layers. A masked transformer predicts missing tokens (similar to BERT), and a residual transformer refines these predictions to incorporate fine motion details.* 809 | 810 | - **T2LM Long-Term 3D Human Motion** [πŸ”—](https://arxiv.org/abs/2406.00636) 811 | *Transforms sequences of text descriptions into 3D motion sequences using a 1D-convolutional VQ-VAE and a transformer-based text encoder. This method generates smooth transitions between actions, outperforming earlier techniques like TEACH.* 812 | 813 | - **MotionGPT** [πŸ”—](https://motion-gpt.github.io/) 814 | *Generates human motion from text by leveraging a pre-trained motion VQ-VAE alongside a large language model (LLM). The LLM is fine-tuned with LoRA to generate motion tokens that the VQ-VAE decoder transforms into motion sequences, significantly speeding up training.* 815 | 816 | 817 | #### 🌈 Diffusion-Based Models 818 | 819 | - **Flame** [πŸ”—](https://kakaobrain.github.io/flame/) 820 | *Employs a transformer-based motion decoder within a diffusion framework. It conditions on text using cross-attention (with text embeddings from RoBERTA) and incorporates special tokens for motion length and diffusion time steps. The model is optimized with a hybrid loss combining diffusion noise loss and a variational lower bound loss, with classifier-free guidance during inference.* 821 | 822 | - **MotionDiffuse** [πŸ”—](https://mingyuan-zhang.github.io/projects/MotionDiffuse.html) 823 | *Similar to Flame but with slight architectural variations: it selects a random diffusion time step and divides the motion sequence into sub-intervals for time-varying conditioning. It utilizes efficient attention modules and optimizes using mean squared error on the noise prediction.* 824 | 825 | - **HMDM** [πŸ”—](https://www.researchgate.net/publication/389662135_IT-HMDM_Invertible_Transformer_for_Human_Motion_Diffusion_Model) 826 | *A diffusion-based model with a fixed motion sequence length that leverages CLIP’s text encoder. It introduces additional loss functions (e.g., position, foot, and velocity losses) defined on the reconstructed motion signal, rather than just the noise, to improve temporal consistency and motion fidelity.* 827 | 828 | - **Make-An-Animation** [πŸ”—](https://azadis.github.io/make-an-animation/) 829 | *Proposes a two-stage diffusion framework for text-to-3D motion generation. The model pre-trains on a large-scale static pose dataset using a UNet backbone and T5 text encoder, then fine-tunes on motion datasets, generating the entire motion sequence concurrently for improved smoothness.* 830 | 831 | - **GMD (Guided Motion Diffusion)** [πŸ”—](https://korrawe.github.io/gmd-project/) 832 | *Focuses on incorporating spatial (trajectory) constraints into the diffusion process. The method uses a two-stage pipeline that first emphasizes ground location guidance and then propagates sparse guidance gradients across neighboring frames to enhance overall motion consistency.* 833 | 834 | - **OmniControl** [πŸ”—](https://neu-vi.github.io/omnicontrol/) 835 | *Extends spatial guidance by cumulatively summing relative pelvis locations to infer global positions. It also introduces realism guidance, propagating control signals from keyframes and the pelvis to other joints for coherent, natural motion generation.* 836 | 837 | --- 838 | 839 | ## πŸ“¦ Object 840 | Discusses approaches for text-to-3D object generation, such as Neural Radiance Fields (NeRFs) and 3D Gaussian Splatting, to create realistic assets. 841 | 842 | ### πŸ—‚ Datasets 843 | 844 | 845 | | 🏷️ Name | πŸ“Š Statistics | πŸ” Modalities | πŸ”— Link | 846 | | -------------- | -------------------------------------------------------------------------------------- | -------------------------------------------------- | --------------------------------------------------------------------------------------- | 847 | | ShapeNet | 3D models in categories like furniture and vehicles. | πŸ”· 3D/Point Cloud Data, πŸ“ Text | [ShapeNet](http://shapenet.org/about) | 848 | | BuildingNet | Architectural structures for shape completion tasks. | πŸ”· 3D/Point Cloud Data, πŸ“ Text | [BuildingNet](https://github.com/BuildingNet/BuildingNet) | 849 | | Text2Shape | Textual descriptions linked to ShapeNet categories. | πŸ“ Text, πŸ”· 3D/Point Cloud Data | [Text2Shape](https://text2shape.github.io/) | 850 | | ShapeGlot | Textual utterances describing differences between shapes. | πŸ“ Text, πŸ”· 3D/Point Cloud Data | [ShapeGlot](https://github.com/alters-mit/ShapeGlot) | 851 | | Pix3D | 3D models aligned with real-world images for evaluation. | πŸ–ΌοΈ Images, πŸ”· 3D/Point Cloud Data | [Pix3D](https://github.com/xingyuansun/pix3d) | 852 | | LAION-5B | Large-scale dataset with 5 billion image-text pairs. | πŸ–ΌοΈ Images, πŸ“ Text | [LAION-5B](https://laion.ai/) | 853 | | COCO-Stuff | Annotated images for real-world 3D synthesis. | πŸ–ΌοΈ Images, πŸ“ Text | [COCO-Stuff](https://github.com/nightrome/cocostuff) | 854 | | Flickr30K | Image dataset with diverse textual descriptions. | πŸ–ΌοΈ Images, πŸ“ Text | [Flickr30K](https://github.com/ubmdmg/Flickr30kEntities) | 855 | | ModelNet40 | 3D CAD models across 40 object categories. | πŸ”· 3D/Point Cloud Data | [ModelNet40](https://modelnet.cs.princeton.edu/) | 856 | | ShapeNetCore | Subset of ShapeNet with detailed object models. | πŸ”· 3D/Point Cloud Data, πŸ“ Text | [ShapeNetCore](http://shapenet.org/about) | 857 | | BlendSwap | Realistic 3D models with physically based rendering (PBR). | πŸ”· 3D/Point Cloud Data, πŸ–ΌοΈ Images | [BlendSwap](https://www.blendswap.com/) | 858 | | InstructPix2Pix| Dataset for instruction-driven image modifications. | πŸ–ΌοΈ Images, πŸ“ Text | [InstructPix2Pix](https://github.com/timothybrooks/instruct-pix2pix) | 859 | | MagicBrush | Dataset for refining texture and appearance in 3D. | πŸ–ΌοΈ Images | [MagicBrush](https://github.com/OSU-NLP-Group/MagicBrush) | 860 | | NeRF-Synthetic | 2D images rendered from synthetic 3D scenes. | πŸ–ΌοΈ Images | [NeRF-Synthetic](https://github.com/bmild/nerf) | 861 | | ScanNet | 2.5M RGB-D views with semantic segmentations and camera poses. | πŸ–ΌοΈ Images, πŸ“ Text | [ScanNet](http://www.scan-net.org/) | 862 | | Matterport3D | 10,800 panoramic views from 90 building-scale scenes. | πŸ–ΌοΈ Images, πŸ”· 3D/Point Cloud Data, πŸ“ Text | [Matterport3D](https://github.com/niessner/Matterport) | 863 | 864 | 865 | --- 866 | 867 | ### πŸ€– Models 868 | 869 | --- 870 | 871 | ## 🧡 Texture 872 | 873 | Focuses on methods for generating detailed surface textures that enhance the realism of 3D models, including text-guided synthesis and neural rendering techniques. 874 | 875 | ### πŸ—‚ Datasets 876 | 877 | | 🏷️ Name | πŸ“Š Statistics | πŸ” Modalities | πŸ”— Link | 878 | |-----------------------|-------------------------------------------------------------------------------------------------------------|-----------------------------------------------|-----------------------------------------------------------------------------------------------| 879 | | **3D-FUTURE** | 9,992 detailed 3D furniture models with high-res textures; 20,240 synthetic images across 5,000 scenes. | πŸ”· 3D Geometry, πŸ–ΌοΈ Texture, πŸ–ΌοΈ 2D Images | [3D-FUTURE](https://tianchi.aliyun.com/specials/promotion/alibaba-3d-future) | 880 | | **Objaverse** | Over 800 K textured 3D models with natural-language descriptions across diverse categories. | πŸ”· 3D Geometry, πŸ–ΌοΈ Texture, πŸ“ Language | [Objaverse](https://objaverse.allenai.org/) | 881 | | **ShapeNet** | Large-scale structured 3D meshes (incl. 300 car models) used for texture benchmarking. | πŸ”· 3D Geometry, πŸ–ΌοΈ Texture | [ShapeNet](https://shapenet.org/) | 882 | | **ShapeNetSem** | Semantic extension of ShapeNet with 445 annotated meshes for structure-aware evaluation. | πŸ”· 3D Geometry, πŸ–ΌοΈ Texture | [ShapeNetSem](https://shapenet.org/) | 883 | | **ModelNet40** | 40-category CAD benchmark for generalization testing in geometry-aware texture generation. | πŸ”· 3D Geometry | [ModelNet40](https://modelnet.cs.princeton.edu/) | 884 | | **Sketchfab** | Repository of commercial and scanned 3D models for qualitative texture evaluation. | πŸ”· 3D Geometry, πŸ–ΌοΈ Texture | [Sketchfab](https://sketchfab.com/) | 885 | | **CGTrader** | High-res 3D assets for mesh diversity in text-driven synthesis. | πŸ”· 3D Geometry, πŸ–ΌοΈ Texture | [CGTrader](https://www.cgtrader.com/) | 886 | | **TurboSquid** | Commercial dataset of detailed assets and fine-surface textures for high-fidelity evaluations. | πŸ”· 3D Geometry, πŸ–ΌοΈ Texture | [TurboSquid](https://www.turbosquid.com/) | 887 | | **RenderPeople** | High-quality human scans with detailed anatomy and surface properties for text-to-texture testing. | πŸ”· 3D Scans | [RenderPeople](https://renderpeople.com/) | 888 | | **Tripleganger** | Scanned high-fidelity human models for evaluating facial and clothing texture realism. | πŸ”· 3D Scans | [Tripleganger](https://triplegangers.com/) | 889 | | **Stanford 3D Scans** | High-resolution object scans for generalization tests on real-world geometries. | πŸ”· 3D Scans | [Stanford 3D Scans](http://graphics.stanford.edu/data/3Dscanrep/) | 890 | | **ElBa** | 30 K synthetic texture images with 3 M texel-level annotations for element-based analysis. | πŸ–ΌοΈ 2D Texture, πŸ“ Attributes & Layout | [ElBa](https://github.com/godimarcovr/Texel-Att) | 891 | 892 | --- 893 | 894 | ### πŸ€– Models 895 | 896 | - **CLIP-Pseudo Inpainting** πŸ”— 897 | Pioneering masked-inpainting pipeline using CLIP pseudo-captioning to semantically align 2D renderings with 3D geometry without paired text data. 898 | [arXiv:2303.13273](https://arxiv.org/abs/2303.13273) 899 | 900 | - **Text2Tex** πŸ”— 901 | Two-stage diffusion: Stage I generates initial textures via depth-to-image denoising; Stage II back-projects and refines them in UV space by selecting extra views to correct artifacts. 902 | [arXiv:2303.11396](https://arxiv.org/abs/2303.11396) 903 | 904 | - **TEXTure** πŸ”— 905 | Inpainting-based diffusion with trip-based surface segmentation to generate, refine, or preserve regions, ensuring smooth transitions and efficient passes. 906 | [Project page](https://texturepaper.github.io/TEXTurePaper/) 907 | 908 | - **Paint-it** πŸ”— 909 | Integrates PBR rendering and U-Net reparameterization with CLIP-guided Score Distillation Sampling for high-fidelity mesh texturing, at the cost of per-model optimization time. 910 | [arXiv:2312.11360](https://arxiv.org/abs/2312.11360) 911 | 912 | - **Point-UV Diffusion** πŸ”— 913 | Coarse-to-fine pipeline: initial mesh-surface painting then 2D UV diffusion refinement, decoupling global structure generation from fine-detail synthesis. 914 | [ICCV 2023 paper](https://openaccess.thecvf.com/content/ICCV2023/papers/Yu_Texture_Generation_on_3D_Meshes_with_Point-UV_Diffusion_ICCV_2023_paper.pdf) 915 | 916 | - **TexPainter** πŸ”— 917 | Latent diffusion in color-space embeddings using depth-conditioned DDIM sampling across fixed viewpoints, aggregated into a unified texture map. 918 | [arXiv:2406.18539](https://arxiv.org/abs/2406.18539) 919 | 920 | - **TexFusion** πŸ”— 921 | Sequential Interlaced Multiview Sampler fuses multi-view latent features during diffusion, reducing inference time while preserving cross-view coherence. 922 | [arXiv:2310.13772](https://arxiv.org/abs/2310.13772) 923 | 924 | - **GenesisTex** πŸ”— 925 | Cross-view attention during diffusion followed by Img2Img post-processing to eliminate seams and enhance surface detail in UV maps. 926 | [arXiv:2403.17782](https://arxiv.org/abs/2403.17782) 927 | 928 | - **ConsistencyΒ²** πŸ”— 929 | Latent Consistency Models that achieve fast, multi-view coherent textures with just four denoising steps, disentangling noise and color paths. 930 | [arXiv:2406.11202](https://arxiv.org/abs/2406.11202) 931 | 932 | - **Meta 3D TextureGen** πŸ”— 933 | Two-stage: geometry-aware diffusion produces multi-view images; incidence-aware UV inpainting and patch upscaling yield seamless 4K textures. 934 | [Meta Research](https://ai.meta.com/research/publications/meta-3d-texturegen-fast-and-consistent-texture-generation-for-3d-objects/) 935 | 936 | - **VCD-Texture** πŸ”— 937 | Variance Alignment with joint noise prediction and multi-view aggregation modules to maintain statistical feature consistency across views. 938 | [arXiv:2407.04461](https://arxiv.org/abs/2407.04461) 939 | 940 | 941 | 942 | 943 | 946 | 947 | 948 | 949 | 1056 | 1057 | 1058 | --- 1059 | ## πŸ”— Citations 1060 | If you find our paper or repository useful, please cite the paper: 1061 | ``` 1062 | @article{abootorabi2025generative, 1063 | title={Generative AI for Character Animation: A Comprehensive Survey of Techniques, Applications, and Future Directions}, 1064 | author={Abootorabi, Mohammad Mahdi and Ghahroodi, Omid and Zahraei, Pardis Sadat and Behzadasl, Hossein and Mirrokni, Alireza and Salimipanah, Mobina and Rasouli, Arash and Behzadipour, Bahar and Azarnoush, Sara and Maleki, Benyamin and others}, 1065 | journal={arXiv preprint arXiv:2504.19056}, 1066 | year={2025} 1067 | } 1068 | ``` 1069 | ## πŸ“§ Contact 1070 | If you have questions, please send an email to mahdi.abootorabi2@gmail.com. 1071 | 1072 | ## ⭐ Star History 1073 | [![Star History Chart](https://api.star-history.com/svg?repos=llm-lab-org/Generative-AI-for-Character-Animation-Survey&type=Date)](https://www.star-history.com/#llm-lab-org/Generative-AI-for-Character-Animation-Survey&Date) 1074 | --------------------------------------------------------------------------------