├── LICENSE
├── README.md
├── README_en.md
└── assets
    ├── image-1.png
    ├── image-2.png
    ├── image-3.png
    ├── image-4.png
    ├── image-5.png
    ├── image-6.png
    ├── image-7.png
    ├── image-8.png
    ├── image.png
    ├── image10.png
    └── image9.png


/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2023 yuanzhuo
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # 数字人主要技术整理
  2 | 
  3 | **中文** | [**English**](./README_en.md)  
  4 | 
  5 | 目前数字人主要包括形象、声音和对话能力几方面。主要交互方式为直接与数字人进行对话。以下从多方面进行了收集和总结，以期提供快速入门帮助。
  6 | 
  7 | ---
  8 | 
  9 | ***更新数字人图示，以输入输出流程中涉及到的各种技术和代表解决方案做直观呈现***
 10 | 
 11 | ![未命名文件](https://github.com/user-attachments/assets/2b60395f-dcfb-4703-bdbb-bd98586a5e80)
 12 | 
 13 | 公开分享链接如下，欢迎修改完善：https://www.processon.com/embed/60051bca7d9c084cf9ec5dad?cid=60051bca7d9c084cf9ec5dae
 14 | 
 15 | ---
 16 | 
 17 | ## Demo Project
 18 | 
 19 | ### 数字人学术汇报
 20 | 
 21 | 通过非常少的原始素材，生成高质量的学术汇报、产品汇报数字人视频。需要素材：（1）一张真人照片、（2）一段此人的10秒钟任意语种的音频，及（3）原始PPT，生成一段数字人学术汇报。
 22 | 
 23 | 
 24 | https://github.com/user-attachments/assets/ad846bff-18ac-4bc0-b964-b6c668db6968
 25 | 
 26 | https://github.com/user-attachments/assets/1aadcc4f-46b4-4097-aeb0-03307b83da6f
 27 | 
 28 | 
 29 | #### 1. 文本准备：
 30 | 使用gpt-4o或其他视觉大模型，定制Prompt（以xx身份帮我生成演讲逐字稿，语气轻松、我会逐页上传，注意每一页前后衔接等），逐页上传PPT，获取演讲稿。（需要不断优化以取得最好效果）
 31 | 
 32 | #### 2. 声音克隆：
 33 | 使用声音机进行克隆，开源方案CosyVoice（80分），闭源方案heygen（目前采用的方案，90分）
 34 | 
 35 | #### 3. 照片驱动数字人原始视频
 36 | 
 37 | ##### 3.1 使用阿里云PAI ArtLab生成类卡通数字人形象
 38 | 
 39 | 项目介绍：https://mp.weixin.qq.com/s/DaP9rvW6A9jx1GoLyU0zHQ  
 40 | 直达链接：https://x.sm.cn/GEGDfU9  
 41 | 
 42 | 这种方法的优点是生成的数字人在保证真实的情况下，又带一些卡通，可以显著降低恐怖谷效应（所有观看者的反馈）。
 43 | ![demo2](https://github.com/user-attachments/assets/d05a75a0-41cc-4de6-b57c-a63022367260)
 44 | 
 45 | ##### 3.2 照片驱动
 46 | 开源方案：50-70分，闭源方案heygen（目前采用的方案，90分）
 47 | 
 48 | #### 4. 后期合成
 49 | 数字人进行抠像后与PPT逐页合成，
 50 | 优化：如果抠像发现部分不完善，请通过PS等软件将png照片素材部分填充白色背景尝试解决。
 51 | 
 52 | #### 5. 多语言支持
 53 | 
 54 | 将PPT通过翻译狗（fanyigou.com）等软件进行跨语言翻译，经测试可以生成较好的效果。并将逐字稿翻译成对应语言，声音克隆为对应语音并进行合成。
 55 | 
 56 | ---
 57 | 
 58 | ## 0. 实时感知交互能力
 59 | 
 60 | ### 0.1 GPT4o
 61 | 
 62 | 随着[GPT-4o](https://openai.com/index/hello-gpt-4o/)的一系列演示视频的发布，几乎解决了实时性的问题，通过**实时对话、打断、主动提问，以及实时分析摄像头内容**，结合本地知识库、Agent等能力，让数字人一下子达到了更高级别的可用性。
 63 | 
 64 | **无需实体形象的可用场景（可穿戴设备：实时采集、云端处理、语音及图像反馈）：**
 65 | 
 66 | - 个人实时助手
 67 | - 盲人助手
 68 | - 翻译助手
 69 | - 学生学习辅导
 70 | - 其他（欢迎提交补充）
 71 | 
 72 | **需要实体形象的可用场景：**
 73 | 
 74 | - 数字人赋能，但是目前还未有技术能解决数字人的互动能力，比如实时往嘴上涂口红，自由镜头下的多角度运动等
 75 | - 实体机器人赋能，如救援机器人的自主决策、和控制人员通过自然语言或特定语法进行交流等。
 76 | - 其他
 77 | 
 78 | 目前OpenAI还**暂未提供演示中涉及的声音和视频的API**，而只提供了GPT4o的文字对话和图片识别能力，相较于之前的GPT4-Vision-Preview等区别不大。
 79 | 
 80 | 相应演示视频：
 81 | 
 82 | <https://player.bilibili.com/player.html?isOutside=true&aid=1454557368&bvid=BV1Vi421X7Xf&cid=1544530003&p=1>
 83 | 
 84 | ### 0.2 其他实现
 85 | 
 86 | tbd
 87 | 
 88 | ## 1. 形象驱动
 89 | 
 90 | ### 1.1 真人录制+算法驱动
 91 | 
 92 | 真人出镜录制素材视频，后期通过对AI驱动口型和姿态等方式实现数字人
 93 | 
 94 | - 优点：难辨真假（因为是直接录制的真人素材），口型对得准，可实时直播也可录播。
 95 | - 缺点：贵（可能）
 96 | 
 97 | > 本图片中右侧为数字人，左侧为真人
 98 | ![数字人1](assets/image.png)
 99 | ![数字人2](assets/image-1.png)
100 | 
101 | 相应演示视频：
102 | 
103 | <https://player.bilibili.com/player.html?isOutside=true&aid=701718909&bvid=BV1vm4y1x7nm&cid=1217022011&p=1>
104 | 
105 | ---
106 | 
107 | **相关技术：**
108 | 
109 | - 唇形同步Lip Sync技术（代表：[Wav2Lip](https://github.com/Rudrabha/Wav2Lip)、[HeyGen](https://www.heygen.com/)、[rask.ai](https://rask.ai/)）
110 | - 实时视频换脸（代表：[DeepFakeLive](https://www.deepfakevfx.com/downloads/deepfacelive/)、[FaceFusion](https://github.com/facefusion/facefusion)、[fal.ai](https://fal.ai/models/fal-ai/fast-turbo-diffusion/playground)）
111 | - 图片转视频（代表：[MuseTalk](https://github.com/TMElyralab/MuseTalk)、[Sadtalker](https://github.com/OpenTalker/SadTalker)）
112 | 
113 | ### 1.2 建模+算法驱动
114 | 
115 | 建模有更高的自由度，有高精度建模和低精度建模等各种方式丰俭由人，也可以另辟蹊径建造卡通形象等。
116 | 
117 | **代表技术：**
118 | 
119 | [Meta Human](https://www.unrealengine.com/en-US/metahuman)
120 | ![alt text](assets/image-5.png)
121 | 
122 |  [NVIDIA Omniverse Audio2Face](https://www.nvidia.com/en-us/omniverse/apps/audio2face/)
123 | ![alt text](assets/image-4.png)
124 | 
125 | [Live2D](https://www.live2d.com/en/)
126 | ![alt text](assets/image-2.png)
127 | 
128 | [Adobe Character Animator](https://www.adobe.com/hk_en/products/character-animator.html)
129 | ![alt text](assets/image-3.png)
130 | 
131 | ## 2. 声音模仿
132 | 
133 | **一些非专业的背景知识补充：**  
134 | 数字人声音可使用现有模型的TTS，或使用自训练的声音模型。声学模型是声音合成系统的重要组成部分。
135 | ![声学模型](https://i0.hdslb.com/bfs/article/439a654b5efa2b623d5e6cbd68ac525665ad737b.png@1256w_240h_!web-article-pic.avif)
136 | 
137 | 主流声学模型包括[VITS](https://github.com/jaywalnut310/vits)、[Tacotron](https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/SpeechSynthesis/Tacotron2)、[FastSpeech2](https://github.com/ming024/FastSpeech2)等。VITS（Variational Inference with adversarial learning for end-to-end Text-to-Speech）是一种语音合成方法，它使用预先训练好的语音编码器 (vocoder声码器) 将文本转化为语音。
138 | ![vits process](https://i0.hdslb.com/bfs/article/6fb3acf043b2842d861066653a85fff84be95af7.png@1256w_726h_!web-article-pic.avif)
139 | 
140 | 之前流行的AI孙燕姿等，采用技术为[so-vits-svc](https://github.com/svc-develop-team/so-vits-svc/tree/4.1-Stable)，全称SoftVC VITS Singing Voice Conversion。该技术是一个声音爱好者基于[softVC](https://github.com/bshall/soft-vc)和[VITS](https://github.com/jaywalnut310/vits)修改而来。
141 | 
142 | ### 声音模仿相关热点项目（截止2024.6）
143 | 
144 | **1. GPT-SoVITS** *27.0K stars*  
145 | 声音模型训练项目，少量文本即可实现微调模型，提供WebUI。
146 | 
147 | **项目地址：** <https://github.com/RVC-Boss/GPT-SoVITS>  
148 | 
149 | **演示视频：**
150 | 
151 | <https://player.bilibili.com/player.html?isOutside=true&aid=836354039&bvid=BV12g4y1m7Uw&cid=1406840960&p=1>
152 | 
153 | **2. so-vits-svc** *24.4K stars*  
154 | 声音模型训练项目，代表：AI孙燕姿。
155 | 
156 | **项目地址：** <https://github.com/svc-develop-team/so-vits-svc>  
157 | 
158 | **演示视频：**
159 | 
160 | <https://player.bilibili.com/player.html?isOutside=true&aid=572772327&bvid=BV1Mz4y1p7hY&cid=1178460068&p=1>
161 | 
162 | **3. ChatTTS** *23.3K stars*  
163 | 非声音克隆。但是其文字转声音效果非常好，有停顿，有语气，有情绪。原生中文支持。网络提供了Windows、Linux等各种一键部署包、懒人包等。
164 | 
165 | **项目地址：** <https://github.com/2noise/ChatTTS>  
166 | 
167 | **演示视频：**
168 | 
169 | <https://player.bilibili.com/player.html?isOutside=true&aid=1055092304&bvid=BV1zn4y1o7iV&cid=1561584918&p=1>
170 | 
171 | **其他：**[剪映capcut声音克隆](https://www.capcut.cn/)、[睿声Reecho](https://www.reecho.ai/)、[Emotional VITS](https://github.com/innnky/emotional-vits)、[Bark](https://github.com/suno-ai/bark)
172 | 
173 | ## 3. 互动技术
174 | 
175 | **tbd**  
176 | *如多镜头多角度下的数字人、实时换装、化妆等。*
177 | 
178 | ## 4. 应用场景及综合代表项目
179 | 
180 | 数字人在自媒体（知识科普等相关口播博主）、电商直播带货、教育教学领域有所应用。在数字生命（已故亲人）等领域（和AR、VR等结合）也有探索。此外，数字人技术和实体机器人的融合等也是题中应有之义。
181 | 
182 | **代表项目：**
183 | 
184 | 1. [AI-Vtuber](https://github.com/Ikaros-521/AI-Vtuber)  
185 | 
186 | 【开源】AI Vtuber是一个由大模型驱动的、融合外观、声音的虚拟AI主播
187 | 
188 | 2. [Fay](https://github.com/TheRamU/Fay)  
189 | 【开源】Fay是一个完整的开源项目，包含Fay控制器及数字人模型，可灵活组合出不同的应用场景：虚拟主播、现场推销货、商品导购、语音助理、远程语音助理、数字人互动、数字人面试官及心理测评、贾维斯、Her。
190 | 
191 | 3. [HeyGen](https://www.heygen.com/)  
192 | 【海外/华人创办】AI视频制作热门平台，提供数字分身、声音克隆等多种相关功能。  
193 | ![alt text](assets/image-7.png)
194 | 
195 | 4. [特看科技](https://www.zhubobao.com/)  
196 | 【国产商用】基于真人视频的高质量数字人
197 | ![alt text](assets/image-6.png)
198 | 
199 | 5. [腾讯智影](https://zenvideo.qq.com/)  
200 | 【国产商用】融合多种AIGC能力的综合创作平台。
201 | ![alt text](assets/image-8.png)
202 | 
203 | 6. [超能科智](https://mp.weixin.qq.com/s/etcD4SEMznBctOjuNJty2A)  
204 | 【国产商用】AIGC课程内容生产代表，提供内容生产和服务一站式平台
205 | ![alt text](assets/image9.png)
206 | 
207 | 7. [飞影数字人](https://hifly.cc)  
208 | 【国产商用】提供数字分身、声音克隆等多种功能  
209 | ![alt text](assets/image10.png)
210 | 
211 | ## 5. 法律法规、代表性新闻
212 | 
213 | - **[《互联网信息服务深度合成管理规定》](https://www.gov.cn/zhengce/zhengceku/2022-12/12/content_5731431.htm)**  
214 | *深度合成服务提供者和技术支持者提供智能对话、合成人声、人脸生成、沉浸式拟真场景等生成或者显著改变信息内容功能的服务的，应当进行显著标识，避免公众混淆或者误认*
215 | 
216 | - **[《北京市促进数字人产业创新发展行动计划》](https://www.beijing.gov.cn/zhengce/zhengcefagui/202208/W020220808406785112297.pdf)**  
217 | *北京市经济和信息化局发布国内首个数字人产业专项支持政策——《北京市促进数字人产业创新发展行动计划（2022—2025年）》*
218 | - **[大模型、数字人技术在教育领域中如何得以应用？](https://learning.sohu.com/a/713671752_120619005)**  
219 | *“用科技促进教育发展，让更多人受益，是我们的初心。构建更有效果、更有效率、更有体验感的教育，让全球的学习者都能享有优质数字教育资源。”
220 | 8月20日，在2023全球智慧教育大会现场，北京师范大学智慧学习研究院副院长、网龙副总裁陈长杰在接受媒体采访时做了上述表示。*
221 | 
222 | - **[全球未来教育设计大赛](https://gcd4fe.bnu.edu.cn/)** 项目实际体验使用说明文档（AIGC夏令营&GCD4FE 48H）
223 | *<https://yuanzhuo.bnu.edu.cn/downloads/gcd4fe_ai_story.html>*
224 | 
225 | ## 6. 数字人的大脑 Large Langurage Model
226 | 
227 | ### 目前支持图片识别和处理的多模态模型主要有
228 | 
229 | gpt-4o，gpt-4-vision-preview，gemini-pro-vision，智浦GLM-4V，零一科技yi-vl-plus，通义千问Qwen-VL-Max、LLaVA（开源）等。
230 | 
231 | ### 各模型API申请地址
232 | 
233 | - baidu  
234 | <https://console.bce.baidu.com/qianfan/ais/console/applicationConsole/application>
235 | 
236 | - 360  
237 | <https://ai.360.com/open>
238 | 
239 | - qwen  
240 | <https://dashscope.console.aliyun.com/overview>
241 | 
242 | - xinghuo  
243 | <https://console.xfyun.cn/services/bm35>
244 | 
245 | - zhipu  
246 | <https://open.bigmodel.cn/overview>
247 | 
248 | - moonshot ai  
249 | <https://platform.moonshot.cn/console/api-keys>
250 | 
251 | - hunyuan  
252 | <https://console.cloud.tencent.com/cam/capi>
253 | 
254 | - baichuan  
255 | <https://platform.baichuan-ai.com/console/apikey>
256 | 
257 | - minimax  
258 | <https://www.minimaxi.com/user-center/basic-information/interface-key>
259 | 
260 | - 零一  
261 | <https://platform.lingyiwanwu.com/apikeys>
262 | 
263 | - 阶跃星辰  
264 | <https://platform.stepfun.com>
265 | 
266 | - claude  
267 | <https://www.anthropic.com/api>
268 | 
269 | - gemini  
270 | <https://ai.google.dev/gemini-api/docs/api-key>
271 | 
272 | ### 开源大模型集成前端
273 | 
274 | - **Open-WebUI** 提供丰富功能的WebUI，可集成各类大模型，具有用户组功能，管理员可便捷管理多用户，并收集用户详细使用数据。  
275 | <https://github.com/open-webui/open-webui>
276 | 
277 | - **Jan** 提供多平台客户端，集成各种开源和API模型，UI美观简单易用  
278 | <https://github.com/janhq/jan>
279 | 
280 | - **Langchain-Chatchat** 通过WebUI提供多模型支持和RAG、agent等功能。  
281 | <https://github.com/chatchat-space/Langchain-Chatchat>
282 | 
283 | ### 大模型API集成管理网关
284 | 
285 | - Ollama  
286 | <https://ollama.com>
287 | 
288 | - LiteLLM  
289 | <https://litellm.vercel.app>
290 | 
291 | - VLLM  
292 | <https://docs.vllm.ai/en/latest/index.html>
293 | 
294 | - OneAPI  
295 | <https://github.com/songquanpeng/one-api>
296 | 
297 | - gateway  
298 | <https://github.com/Portkey-AI/gateway>
299 | 
300 | ### 本地知识库和智能体构建
301 | 
302 | - FastGPT  
303 | <https://github.com/labring/FastGPT>
304 | 
305 | - dify  
306 | <https://github.com/langgenius/dify>
307 | 
308 | - Coze  
309 | <https://www.coze.cn>  
310 | <https://www.coze.com>
311 | 
312 | ### 大模型自动化测评工具
313 | 
314 | - OpenCompass  
315 | <https://github.com/open-compass/opencompass>
316 | 
317 | ## Star History
318 | 
319 | [![Star History Chart](https://api.star-history.com/svg?repos=YUANZHUO-BNU/metahuman_overview&type=Date)](https://star-history.com/#YUANZHUO-BNU/metahuman_overview&Date)
320 | 


--------------------------------------------------------------------------------
/README_en.md:
--------------------------------------------------------------------------------
  1 | # Summary of Major Technologies in Digital Humans
  2 | 
  3 | [**中文**](./README.md) | **English**  
  4 | 
  5 | Currently, digital humans mainly encompass aspects of appearance, voice, and conversational abilities. The primary mode of interaction is through direct conversation with the digital human. The following has been collected and summarized from various sources in order to provide quick-start assistance.
  6 | 
  7 | ---
  8 | 
  9 | ***Update the digital human icon for a visual presentation of the various technologies and representative solutions involved in the input and output process.***
 10 | 
 11 | ![Untitled File](https://github.com/user-attachments/assets/2b60395f-dcfb-4703-bdbb-bd98586a5e80)
 12 | 
 13 | The public sharing link is as follows. Feel free to modify and improve: https://www.processon.com/embed/60051bca7d9c084cf9ec5dad?cid=60051bca7d9c084cf9ec5dae
 14 | 
 15 | ---
 16 | 
 17 | ## Demo Project
 18 | ### Digital Human Academic Presentation
 19 | 
 20 | Generate high-quality academic and product report videos using minimal raw materials. Required materials: (1) A photo of a real person, (2) A 10-second audio clip of the person in any language, and (3) The original PPT to generate a digital human academic report.
 21 | 
 22 | 
 23 | https://github.com/user-attachments/assets/ad846bff-18ac-4bc0-b964-b6c668db6968  
 24 | 
 25 | https://github.com/user-attachments/assets/1aadcc4f-46b4-4097-aeb0-03307b83da6f  
 26 | 
 27 | 
 28 | #### 1. Text Preparation:
 29 | Use GPT-4o or other visual AI models to customize prompts (e.g., "Help me generate a verbatim speech script as xx, with a relaxed tone. I will upload it page by page, please ensure coherence between pages, etc."). Upload the PPT page by page to get the speech script. (Continuous optimization is required for the best results)
 30 | 
 31 | #### 2. Voice Cloning:
 32 | Clone the voice using a voice engine. Open-source solution: CosyVoice (80/100), Closed-source solution: HeyGen (current choice, 90/100).
 33 | 
 34 | #### 3. Photo-Driven Digital Human Raw Video
 35 | ##### 3.1 Use Alibaba Cloud PAI ArtLab to generate cartoon-like digital human images.
 36 | Project introduction: https://mp.weixin.qq.com/s/DaP9rvW6A9jx1GoLyU0zHQ  
 37 | Direct link: https://x.sm.cn/GEGDfU9  
 38 | The advantage of this method is that it generates a digital human that, while remaining realistic, carries a cartoonish aspect, significantly reducing the uncanny valley effect (feedback from all viewers).
 39 | ![Demo 2](https://github.com/user-attachments/assets/d05a75a0-41cc-4de6-b57c-a63022367260)
 40 | 
 41 | ##### 3.2 Photo-Driven
 42 | Open-source solution: 50-70/100, Closed-source solution: HeyGen (current choice, 90/100).
 43 | 
 44 | #### 4. Post-Production Composition
 45 | After keying out the digital human, combine it with each page of the PPT.  
 46 | Optimization: If imperfections are noticed during keying, try filling in the PNG photo materials with a white background using software like Photoshop.
 47 | 
 48 | #### 5. Multi-language Support
 49 | Use translation software like FanyiGou (fanyigou.com) to translate the PPT into multiple languages. Tested results show good quality. Translate the verbatim script into the corresponding language, clone the voice to match the language, and synthesize accordingly.
 50 | 
 51 | ---
 52 | 
 53 | 
 54 | ## 0. Real-Time Perceptual Interaction Abilities
 55 | 
 56 | ### 0.1 GPT-4o
 57 | 
 58 | With the release of a series of demonstration videos for [GPT-4o](https://openai.com/index/hello-gpt-4o/), real-time issues have been nearly resolved. By combining real-time conversation, interruption, proactive questioning, and real-time analysis of camera content with local knowledge bases and Agent capabilities, digital humans have reached a higher level of usability.
 59 | 
 60 | **Scenarios where no physical appearance is needed (wearable devices: real-time collection, cloud processing, voice and image feedback):**
 61 | 
 62 | - Personal real-time assistant
 63 | - Assistant for the visually impaired
 64 | - Translation assistant
 65 | - Student learning tutor
 66 | - Others (feel free to add more)
 67 | 
 68 | **Scenarios requiring a physical appearance:**
 69 | 
 70 | - Empowering digital humans, although current technology cannot solve interaction abilities such as applying lipstick in real-time or multi-angle movement under free cameras.
 71 | - Empowering physical robots, such as autonomous decision-making for rescue robots and communication with controllers through natural language or specific syntax.
 72 | - Others
 73 | 
 74 | Currently, OpenAI **has not yet provided the API for sound and video as demonstrated**, only offering text dialogue and image recognition capabilities for GPT-4o, which is not significantly different from previous versions like GPT4-Vision-Preview.
 75 | 
 76 | Demo video:
 77 | 
 78 | <https://player.bilibili.com/player.html?isOutside=true&aid=1454557368&bvid=BV1Vi421X7Xf&cid=1544530003&p=1>
 79 | 
 80 | ### 0.2 Other Implementations
 81 | 
 82 | tbd
 83 | 
 84 | ## 1. Appearance Driven
 85 | 
 86 | ### 1.1 Real Person Recording + Algorithmic Drive
 87 | 
 88 | Real person records video material, and later uses AI to drive lip movements and postures to create the digital human.
 89 | 
 90 | - Advantages: Difficult to distinguish from real (since it is directly recorded real person material), accurate lip-syncing, can be live or pre-recorded.
 91 | - Disadvantages: Expensive (possibly)
 92 | 
 93 | > In this image, the right side is a digital human, and the left side is a real person.
 94 | ![Digital Human 1](assets/image.png)
 95 | ![Digital Human 2](assets/image-1.png)
 96 | 
 97 | Demo video:
 98 | 
 99 | <https://player.bilibili.com/player.html?isOutside=true&aid=701718909&bvid=BV1vm4y1x7nm&cid=1217022011&p=1>
100 | 
101 | ---
102 | 
103 | **Related Technologies:**
104 | 
105 | - Lip Sync Technology (Representatives: [Wav2Lip](https://github.com/Rudrabha/Wav2Lip), [HeyGen](https://www.heygen.com/), [rask.ai](https://rask.ai/))
106 | - Real-time Video Face Swap (Representatives: [DeepFakeLive](https://www.deepfakevfx.com/downloads/deepfacelive/), [FaceFusion](https://github.com/facefusion/facefusion), [fal.ai](https://fal.ai/models/fal-ai/fast-turbo-diffusion/playground))
107 | - Image to Video (Representatives: [MuseTalk](https://github.com/TMElyralab/MuseTalk), [Sadtalker](https://github.com/OpenTalker/SadTalker))
108 | 
109 | ### 1.2 Modeling + Algorithmic Drive
110 | 
111 | Modeling offers higher flexibility, with various methods ranging from high-precision to low-precision modeling, and can also create cartoon images.
112 | 
113 | **Representative Technologies:**
114 | 
115 | [Meta Human](https://www.unrealengine.com/en-US/metahuman)  
116 | ![Meta Human](assets/image-5.png)
117 | 
118 | [NVIDIA Omniverse Audio2Face](https://www.nvidia.com/en-us/omniverse/apps/audio2face/)  
119 | ![Audio2Face](assets/image-4.png)
120 | 
121 | [Live2D](https://www.live2d.com/en/)  
122 | ![Live2D](assets/image-2.png)
123 | 
124 | [Adobe Character Animator](https://www.adobe.com/hk_en/products/character-animator.html)  
125 | ![Adobe Character Animator](assets/image-3.png)
126 | 
127 | ## 2. Voice Imitation
128 | 
129 | **Some non-professional background knowledge:**  
130 | The voice of a digital human can use existing TTS models or self-trained voice models. The acoustic model is an essential part of the voice synthesis system.
131 | ![Acoustic Model](https://i0.hdslb.com/bfs/article/439a654b5efa2b623d5e6cbd68ac525665ad737b.png@1256w_240h_!web-article-pic.avif)
132 | 
133 | Mainstream acoustic models include [VITS](https://github.com/jaywalnut310/vits), [Tacotron](https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/SpeechSynthesis/Tacotron2), and [FastSpeech2](https://github.com/ming024/FastSpeech2). VITS (Variational Inference with adversarial learning for end-to-end Text-to-Speech) is a speech synthesis method that uses a pre-trained vocoder to convert text into speech.
134 | ![VITS Process](https://i0.hdslb.com/bfs/article/6fb3acf043b2842d861066653a85fff84be95af7.png@1256w_726h_!web-article-pic.avif)
135 | 
136 | The previously popular AI Stefanie Sun, for example, used the [so-vits-svc](https://github.com/svc-develop-team/so-vits-svc/tree/4.1-Stable) technology, which stands for SoftVC VITS Singing Voice Conversion. This technology was modified by a voice enthusiast based on [softVC](https://github.com/bshall/soft-vc) and [VITS](https://github.com/jaywalnut310/vits).
137 | 
138 | ### Hot Projects Related to Voice Imitation (as of June 2024)
139 | 
140 | **1. GPT-SoVITS** *27.0K stars*  
141 | Voice model training project, can fine-tune models with a small amount of text, provides WebUI.
142 | 
143 | **Project address:** <https://github.com/RVC-Boss/GPT-SoVITS>  
144 | 
145 | **Demo video:**
146 | 
147 | <https://player.bilibili.com/player.html?isOutside=true&aid=836354039&bvid=BV12g4y1m7Uw&cid=1406840960&p=1>
148 | 
149 | **2. so-vits-svc** *24.4K stars*  
150 | Voice model training project, representative: AI Stefanie Sun.
151 | 
152 | **Project address:** <https://github.com/svc-develop-team/so-vits-svc>  
153 | 
154 | **Demo video:**
155 | 
156 | <https://player.bilibili.com/player.html?isOutside=true&aid=572772327&bvid=BV1Mz4y1p7hY&cid=1178460068&p=1>
157 | 
158 | **3. ChatTTS** *23.3K stars*  
159 | Not voice cloning, but its text-to-speech results are excellent, with pauses, intonation, and emotion. Native Chinese support. Various one-click deployment packages and lazy packages are available for Windows, Linux, etc.
160 | 
161 | **Project address:** <https://github.com/2noise/ChatTTS>  
162 | 
163 | **Demo video:**
164 | 
165 | <https://player.bilibili.com/player.html?isOutside=true&aid=1055092304&bvid=BV1zn4y1o7iV&cid=1561584918&p=1>
166 | 
167 | **Others:** [CapCut Voice Cloning](https://www.capcut.cn/), [Reecho](https://www.reecho.ai/), [Emotional VITS](https://github.com/innnky/emotional-vits), [Bark](https://github.com/suno-ai/bark)
168 | 
169 | ## 3. Interaction Technology
170 | 
171 | **tbd**  
172 | *Such as multi-camera and multi-angle digital humans, real-time dressing, makeup, etc.*
173 | 
174 | ## 4. Application Scenarios and Comprehensive Representative Projects
175 | 
176 | Digital humans have applications in self-media (knowledge science popularization and related hosts), e-commerce live streaming, and education. There is also exploration in areas like digital life (deceased relatives) combined with AR and VR. Additionally, the integration of digital human technology and physical robots is also within the scope.
177 | 
178 | **Representative Projects:**
179 | 
180 | 1. [AI-Vtuber](https://github.com/Ikaros-521/AI-Vtuber)  
181 | [Open Source] AI Vtuber is a virtual AI host driven by large models, integrating appearance and voice.
182 | 
183 | 2. [Fay](https://github.com/TheRamU/Fay)  
184 | [Open Source] Fay is a complete open-source project, including Fay controller and digital human model, which can be flexibly combined for different application scenarios: virtual host, live sales, product guide, voice assistant, remote voice assistant, digital human interaction, digital human interviewer and psychological evaluation, Jarvis, Her.
185 | 
186 | 3. [HeyGen](https://www.heygen.com/)  
187 | [Overseas/Chinese-founded]a popular platform that offers various related features, including digital avatars and voice cloning.  
188 | ![HeyGen](assets/image-7.png)
189 | 
190 | 4. [Tekan Technology](https://www.zhubobao.com/)  
191 | [Domestic Commercial] High-quality digital human based on real-person video  
192 | ![Tekan Technology](assets/image-6.png)
193 | 
194 | 5. [Tencent Zhiying](https://zenvideo.qq.com/)  
195 | [Domestic Commercial] A comprehensive creation platform integrating various AIGC capabilities.  
196 | ![Tencent Zhiying](assets/image-8.png)
197 | 
198 | 6. [CZNK.AI](https://mp.weixin.qq.com/s/etcD4SEMznBctOjuNJty2A)  
199 | [Domestic Commercial] AIGC Course Content Production Representative, Providing a One-Stop Platform for Content Production and Services  
200 | ![alt text](assets/image9.png)
201 | 
202 | 7. [hifly](https://hifly.cc)  
203 | [Domestic Commercial] Offers various features, including digital avatars and voice cloning.  
204 | ![alt text](assets/image10.png)
205 | 
206 | ## 5. Laws, Regulations, and Representative News
207 | 
208 | - **[Regulations on the Management of Deep Synthesis of Internet Information Services](https://www.gov.cn/zhengce/zhengceku/2022-12/12/content_5731431.htm)**  
209 | *Providers of deep synthesis services and technical supporters offering services that generate or significantly alter information content with intelligent dialogue, synthetic voices, facial generation, immersive realistic scenarios, etc., should prominently identify the services to avoid public confusion or misidentification.*
210 | 
211 | - **[Beijing Action Plan for Promoting the Innovative Development of the Digital Human Industry](https://www.beijing.gov.cn/zhengce/zhengcefagui/202208/W020220808406785112297.pdf)**  
212 | *The Beijing Municipal Bureau of Economy and Information Technology issued the first domestic special support policy for the digital human industry - "Beijing Action Plan for Promoting the Innovative Development of the Digital Human Industry (2022-2025)".*
213 | - **[How are Large Models and Digital Human Technologies Applied in the Field of Education?](https://learning.sohu.com/a/713671752_120619005)**  
214 | *"Using technology to promote education development and benefit more people is our original intention. Building more effective, efficient, and experiential education allows learners worldwide to enjoy high-quality digital educational resources."  
215 | On August 20, at the 2023 Global Smart Education Conference, Chen Changjie, Vice President of the Smart Learning Institute of Beijing Normal University and Vice President of NetDragon, expressed this in an interview with the media.*
216 | 
217 | - **[Global Future Education Design Competition](https://gcd4fe.bnu.edu.cn/)** Project Practical Experience Instruction Document (AIGC Summer Camp & GCD4FE 48H)
218 | *<https://yuanzhuo.bnu.edu.cn/downloads/gcd4fe_ai_story.html>*
219 | 
220 | ## 6. The Brain of Digital Humans: Large Language Models
221 | 
222 | ### The main multimodal models currently supporting image recognition and processing include
223 | 
224 | gpt-4o, gpt-4-vision-preview, gemini-pro-vision, Zhipu GLM-4V, Lingyi Technology yi-vl-plus, Tongyi Qwen-VL-Max, LLaVA (open source), etc.
225 | 
226 | ### API Application Addresses for Various Models
227 | 
228 | - Baidu  
229 | <https://console.bce.baidu.com/qianfan/ais/console/applicationConsole/application>
230 | 
231 | - 360  
232 | <https://ai.360.com/open>
233 | 
234 | - Qwen  
235 | <https://dashscope.console.aliyun.com/overview>
236 | 
237 | - Xinghuo  
238 | <https://console.xfyun.cn/services/bm35>
239 | 
240 | - Zhipu  
241 | <https://open.bigmodel.cn/overview>
242 | 
243 | - Moonshot AI  
244 | <https://platform.moonshot.cn/console/api-keys>
245 | 
246 | - Hunyuan  
247 | <https://console.cloud.tencent.com/cam/capi>
248 | 
249 | - Baichuan  
250 | <https://platform.baichuan-ai.com/console/apikey>
251 | 
252 | - MiniMax  
253 | <https://www.minimaxi.com/user-center/basic-information/interface-key>
254 | 
255 | - Lingyi  
256 | <https://platform.lingyiwanwu.com/apikeys>
257 | 
258 | - Jieyue Xingchen  
259 | <https://platform.stepfun.com>
260 | 
261 | - Claude  
262 | <https://www.anthropic.com/api>
263 | 
264 | - Gemini  
265 | <https://ai.google.dev/gemini-api/docs/api-key>
266 | 
267 | ### Open-Source Large Model Integration Frontends
268 | 
269 | - **Open-WebUI** Provides a rich-featured WebUI, integrates various large models, has user group functionality, and allows administrators to easily manage multiple users and collect detailed user usage data.  
270 | <https://github.com/open-webui/open-webui>
271 | 
272 | - **Jan** Provides multi-platform clients, integrates various open-source and API models, and features a simple and aesthetically pleasing UI  
273 | <https://github.com/janhq/jan>
274 | 
275 | - **Langchain-Chatchat** Provides multi-model support and RAG, agent functions through WebUI.  
276 | <https://github.com/chatchat-space/Langchain-Chatchat>
277 | 
278 | ### Large Model API Integration Management Gateways
279 | 
280 | - Ollama  
281 | <https://ollama.com>
282 | 
283 | - LiteLLM  
284 | <https://litellm.vercel.app>
285 | 
286 | - VLLM  
287 | <https://docs.vllm.ai/en/latest/index.html>
288 | 
289 | - OneAPI  
290 | <https://github.com/songquanpeng/one-api>
291 | 
292 | - Gateway  
293 | <https://github.com/Portkey-AI/gateway>
294 | 
295 | ### Local Knowledge Base and Intelligent Agent Construction
296 | 
297 | - FastGPT  
298 | <https://github.com/labring/FastGPT>
299 | 
300 | - dify  
301 | <https://github.com/langgenius/dify>
302 | 
303 | - Coze  
304 | <https://www.coze.cn>  
305 | <https://www.coze.com>
306 | 
307 | ### Large Model Automation Evaluation Tools
308 | 
309 | - OpenCompass  
310 | <https://github.com/open-compass/opencompass>
311 | 
312 | ## Star History
313 | 
314 | [![Star History Chart](https://api.star-history.com/svg?repos=YUANZHUO-BNU/metahuman_overview&type=Date)](https://star-history.com/#YUANZHUO-BNU/metahuman_overview&Date)
315 | 


--------------------------------------------------------------------------------
/assets/image-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/YUANZHUO-BNU/metahuman_overview/b6eb463fe4d051d9ef15bfbb8d2ee3c024a72220/assets/image-1.png


--------------------------------------------------------------------------------
/assets/image-2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/YUANZHUO-BNU/metahuman_overview/b6eb463fe4d051d9ef15bfbb8d2ee3c024a72220/assets/image-2.png


--------------------------------------------------------------------------------
/assets/image-3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/YUANZHUO-BNU/metahuman_overview/b6eb463fe4d051d9ef15bfbb8d2ee3c024a72220/assets/image-3.png


--------------------------------------------------------------------------------
/assets/image-4.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/YUANZHUO-BNU/metahuman_overview/b6eb463fe4d051d9ef15bfbb8d2ee3c024a72220/assets/image-4.png


--------------------------------------------------------------------------------
/assets/image-5.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/YUANZHUO-BNU/metahuman_overview/b6eb463fe4d051d9ef15bfbb8d2ee3c024a72220/assets/image-5.png


--------------------------------------------------------------------------------
/assets/image-6.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/YUANZHUO-BNU/metahuman_overview/b6eb463fe4d051d9ef15bfbb8d2ee3c024a72220/assets/image-6.png


--------------------------------------------------------------------------------
/assets/image-7.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/YUANZHUO-BNU/metahuman_overview/b6eb463fe4d051d9ef15bfbb8d2ee3c024a72220/assets/image-7.png


--------------------------------------------------------------------------------
/assets/image-8.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/YUANZHUO-BNU/metahuman_overview/b6eb463fe4d051d9ef15bfbb8d2ee3c024a72220/assets/image-8.png


--------------------------------------------------------------------------------
/assets/image.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/YUANZHUO-BNU/metahuman_overview/b6eb463fe4d051d9ef15bfbb8d2ee3c024a72220/assets/image.png


--------------------------------------------------------------------------------
/assets/image10.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/YUANZHUO-BNU/metahuman_overview/b6eb463fe4d051d9ef15bfbb8d2ee3c024a72220/assets/image10.png


--------------------------------------------------------------------------------
/assets/image9.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/YUANZHUO-BNU/metahuman_overview/b6eb463fe4d051d9ef15bfbb8d2ee3c024a72220/assets/image9.png


--------------------------------------------------------------------------------