├── README.md ├── Talking_to_AI_Generated_People.ipynb └── res └── combined.gif /README.md: -------------------------------------------------------------------------------- 1 | # Talking To AI-Generated People 2 | ### [[Project Video](https://www.youtube.com/watch?v=OCdikmAoLKA)] [[Code Tutorial](https://medium.com/@chintan.t93/how-to-create-fake-talking-head-videos-with-deep-learning-code-tutorial-6d82c315529d)] 3 | ### Fake Faces, Script, Voice and Lip-Sync Animation with Deep Learning 4 | This notebook combines different state-of-the-art image and speech generation neural networks into one single Google Colab Notebook so that we can generate a random fake person's talking head video replying to our input text question. 5 | 6 | ![Fake People](res/combined.gif) 7 | 8 | #### Different Tools/Repositories used:- 9 | 1) Face Generation - www.thispersondoesnotexist.com - StyleGAN2 10 | 2) Text Generation - www.textsynth.org - OpenAI GPT-2 11 | 3) Text-to-Speech Conversion - https://github.com/NVIDIA/flowtron - Flowtron 12 | 4) Lip Animation - https://github.com/Rudrabha/LipGAN - LipGAN 13 | 14 | #### TODO Improvements (Any Volunteers??) :- 15 | 1) Use motion model to animate the face before performing lip-sync. 16 | 2) Use the newer GPT-3 model for better, more coherent text responses. 17 | -------------------------------------------------------------------------------- /Talking_to_AI_Generated_People.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "nbformat": 4, 3 | "nbformat_minor": 0, 4 | "metadata": { 5 | "colab": { 6 | "name": "Talking to AI Generated People", 7 | "provenance": [], 8 | "private_outputs": true, 9 | "collapsed_sections": [], 10 | "toc_visible": true 11 | }, 12 | "kernelspec": { 13 | "name": "python3", 14 | "display_name": "Python 3" 15 | }, 16 | "accelerator": "GPU" 17 | }, 18 | "cells": [ 19 | { 20 | "cell_type": "markdown", 21 | "metadata": { 22 | "id": "QkDQkDN69jaR", 23 | "colab_type": "text" 24 | }, 25 | "source": [ 26 | "# Talking To AI-Generated People | Fake Faces, Script, Voice and Lip-Sync Animation\n", 27 | "## I combined different state-of-the-art image and speech generation neural networks into one single Google Colab Notebook so that we can generate a random fake person's talking head video replying to our text prompt input.\n", 28 | "\n", 29 | "### To run, simply connect to a GPU-instance from the menu Runtime->Change runtime type. Then press Run All under Runtime menu. Text prompt will appear at the bottom of this page (running first time might take upto 10 minutes in setup/installation).\n", 30 | "\n", 31 | "\n", 32 | "#### Credits for different Tools/Repositories used:-\n", 33 | "1) Face Generation - www.thispersondoesnotexist.com - Nvidia StyleGAN2\n", 34 | "2) Text Generation - www.textsynth.org - OpenAI GPT-2\n", 35 | "3) Speech-to-Text Conversion - https://github.com/NVIDIA/flowtron - Flowtron\n", 36 | "4) Lip Animation - https://github.com/Rudrabha/LipGAN - LipGAN\n", 37 | "\n", 38 | "\n", 39 | "#### TODO improvements (any volunteers??) :-\n", 40 | "1) Use motion model to animate the face before performing lip-sync.\n", 41 | "2) Use the newer GPT-3 model for better, more coherent text responses.\n" 42 | ] 43 | }, 44 | { 45 | "cell_type": "markdown", 46 | "metadata": { 47 | "id": "a0bOEB3lB28O", 48 | "colab_type": "text" 49 | }, 50 | "source": [ 51 | "# Step 1: Get an image of a fake person from This-Person-Does-Not-Exist\n", 52 | "Install selenium and chromium webdriver dependencies" 53 | ] 54 | }, 55 | { 56 | "cell_type": "code", 57 | "metadata": { 58 | "id": "n4NAXZFEciiz", 59 | "colab_type": "code", 60 | "colab": {} 61 | }, 62 | "source": [ 63 | "!rm -r sample_data\n", 64 | "!pip install selenium\n", 65 | "!apt-get update # to update ubuntu to correctly run apt install\n", 66 | "!apt install chromium-chromedriver\n", 67 | "!cp /usr/lib/chromium-browser/chromedriver /usr/bin\n", 68 | "import sys\n", 69 | "sys.path.insert(0,'/usr/lib/chromium-browser/chromedriver')\n", 70 | "from selenium import webdriver\n", 71 | "chrome_options = webdriver.ChromeOptions()\n", 72 | "chrome_options.add_argument('--headless')\n", 73 | "chrome_options.add_argument('--no-sandbox')\n", 74 | "chrome_options.add_argument('--disable-dev-shm-usage')\n", 75 | "driver = webdriver.Chrome('chromedriver',chrome_options=chrome_options)" 76 | ], 77 | "execution_count": null, 78 | "outputs": [] 79 | }, 80 | { 81 | "cell_type": "markdown", 82 | "metadata": { 83 | "id": "jQSyx14qgtWn", 84 | "colab_type": "text" 85 | }, 86 | "source": [ 87 | "Download a fake person face from https://thispersondoesnotexist.com/ using the following code. If you want a different face, rerun this code cell until you like one. \n", 88 | "\n", 89 | "Note that the current speech generation model only outputs a female voice, so you may want to pick the faces appropriately. " 90 | ] 91 | }, 92 | { 93 | "cell_type": "code", 94 | "metadata": { 95 | "id": "A-BbCul-dUr7", 96 | "colab_type": "code", 97 | "colab": {} 98 | }, 99 | "source": [ 100 | "from selenium.webdriver.common.action_chains import ActionChains\n", 101 | "driver.get(\"https://thispersondoesnotexist.com/\")\n", 102 | "import time \n", 103 | "time.sleep(5)\n", 104 | "button = driver.find_element_by_id('saveButton')\n", 105 | "ActionChains(driver).move_to_element(button).click(button).perform()\n", 106 | "time.sleep(4)\n", 107 | "from IPython.display import Image\n", 108 | "Image('person.jpg')" 109 | ], 110 | "execution_count": null, 111 | "outputs": [] 112 | }, 113 | { 114 | "cell_type": "markdown", 115 | "metadata": { 116 | "id": "PVW2WZgIhqpX", 117 | "colab_type": "text" 118 | }, 119 | "source": [ 120 | "# Step 2: Generate response script with Text Synth" 121 | ] 122 | }, 123 | { 124 | "cell_type": "code", 125 | "metadata": { 126 | "id": "7sbsT721iBxN", 127 | "colab_type": "code", 128 | "colab": {} 129 | }, 130 | "source": [ 131 | "# prompt = input(\"Ask this person a question: \")\n", 132 | "prompt = 'Hi there, do you know what the time is?'\n", 133 | "\n", 134 | "from selenium.webdriver.common.keys import Keys\n", 135 | "\n", 136 | "driver.get(\"http://textsynth.org/\")\n", 137 | "driver.implicitly_wait(10)\n", 138 | "inputElement = driver.find_element_by_id('input_text')\n", 139 | "inputElement.click()\n", 140 | "inputElement.clear()\n", 141 | "inputElement.send_keys(prompt)\n", 142 | "button = driver.find_element_by_id('submit_button')\n", 143 | "ActionChains(driver).move_to_element(button).click(button).perform()\n", 144 | "time.sleep(10)\n", 145 | "responseElement = driver.find_element_by_id('gtext')\n", 146 | "response = responseElement.text\n", 147 | "response = response[len(prompt):].replace('\\n', ' ')\n", 148 | "print(response)" 149 | ], 150 | "execution_count": null, 151 | "outputs": [] 152 | }, 153 | { 154 | "cell_type": "markdown", 155 | "metadata": { 156 | "id": "ZCFYHgoVeF74", 157 | "colab_type": "text" 158 | }, 159 | "source": [ 160 | "# Step 3: Convert response text to speech with FlowTron\n", 161 | "First, clone the Flowtron Repository and install the requirements (this may take upto 3-4 minutes)" 162 | ] 163 | }, 164 | { 165 | "cell_type": "code", 166 | "metadata": { 167 | "id": "p-p4YSxf-MtA", 168 | "colab_type": "code", 169 | "colab": {} 170 | }, 171 | "source": [ 172 | "!git clone https://github.com/NVIDIA/flowtron.git\n", 173 | "%cd flowtron\n", 174 | "!git submodule update --init\n", 175 | "%cd tacotron2\n", 176 | "!git submodule update --init\n", 177 | "%cd .." 178 | ], 179 | "execution_count": null, 180 | "outputs": [] 181 | }, 182 | { 183 | "cell_type": "code", 184 | "metadata": { 185 | "id": "ozbW0Ec1GiXr", 186 | "colab_type": "code", 187 | "colab": {} 188 | }, 189 | "source": [ 190 | "!pip install virtualenv\n", 191 | "!virtualenv flowtronenv" 192 | ], 193 | "execution_count": null, 194 | "outputs": [] 195 | }, 196 | { 197 | "cell_type": "code", 198 | "metadata": { 199 | "id": "DEy6SnhWGkD1", 200 | "colab_type": "code", 201 | "colab": {} 202 | }, 203 | "source": [ 204 | "!source flowtronenv/bin/activate; pip install numpy==1.16.4 inflect==0.2.5 librosa==0.6.0 scipy==1.0.0 tensorboardX==1.1 Unidecode==1.0.22 pillow matplotlib numba==0.48; pip install torch==1.5.0+cu101 torchvision==0.6.0+cu101 -f https://download.pytorch.org/whl/torch_stable.html" 205 | ], 206 | "execution_count": null, 207 | "outputs": [] 208 | }, 209 | { 210 | "cell_type": "markdown", 211 | "metadata": { 212 | "id": "QowB38xOC5Op", 213 | "colab_type": "text" 214 | }, 215 | "source": [ 216 | "Download Pre-Trained Models" 217 | ] 218 | }, 219 | { 220 | "cell_type": "code", 221 | "metadata": { 222 | "id": "zPSMBn05C4lL", 223 | "colab_type": "code", 224 | "colab": {} 225 | }, 226 | "source": [ 227 | "!wget -N -q https://raw.githubusercontent.com/yhgon/colab_utils/master/gfile.py\n", 228 | "!mkdir models\n", 229 | "!python gfile.py -u 'https://drive.google.com/open?id=1KhJcPawFgmfvwV7tQAOeC253rYstLrs8' -f 'models/flowtron_libritts.pt'\n", 230 | "!python gfile.py -u 'https://drive.google.com/open?id=1Cjd6dK_eFz6DE0PKXKgKxrzTUqzzUDW-' -f 'models/flowtron_ljs.pt'\n", 231 | "!python gfile.py -u 'https://drive.google.com/open?id=1Rm5rV5XaWWiUbIpg5385l5sh68z2bVOE' -f 'models/waveglow_256channels_v4.pt'" 232 | ], 233 | "execution_count": null, 234 | "outputs": [] 235 | }, 236 | { 237 | "cell_type": "markdown", 238 | "metadata": { 239 | "id": "xo2gRAKtB5sX", 240 | "colab_type": "text" 241 | }, 242 | "source": [ 243 | "Inference Demo" 244 | ] 245 | }, 246 | { 247 | "cell_type": "code", 248 | "metadata": { 249 | "id": "1xlclL4p_dKs", 250 | "colab_type": "code", 251 | "colab": {} 252 | }, 253 | "source": [ 254 | "%cd /content/flowtron\n", 255 | "tts_text = response.replace('\\n',' ').replace('\"','')\n", 256 | "print(tts_text)\n", 257 | "!source flowtronenv/bin/activate; python inference.py -c config.json -f models/flowtron_ljs.pt -w models/waveglow_256channels_v4.pt -t \"$tts_text\" -i 0\n", 258 | "\n", 259 | "!cp './results/sid0_sigma0.5.wav' ./..\n", 260 | "%cd ..\n", 261 | "!mv './sid0_sigma0.5.wav' './speech.wav'\n", 262 | "\n", 263 | "from IPython.display import Audio\n", 264 | "sound_file = './results/sid0_sigma0.5.wav'\n", 265 | "Audio(sound_file, autoplay=True)" 266 | ], 267 | "execution_count": null, 268 | "outputs": [] 269 | }, 270 | { 271 | "cell_type": "markdown", 272 | "metadata": { 273 | "id": "NqnW4IyZ-xj9", 274 | "colab_type": "text" 275 | }, 276 | "source": [ 277 | "# Step 4: Create talking head video with LipGAN" 278 | ] 279 | }, 280 | { 281 | "cell_type": "code", 282 | "metadata": { 283 | "id": "487SbKJa-6IW", 284 | "colab_type": "code", 285 | "colab": {} 286 | }, 287 | "source": [ 288 | "%cd /content\n", 289 | "!git clone https://github.com/Rudrabha/LipGAN.git --branch fully_pythonic --single-branch\n", 290 | "%cd LipGAN" 291 | ], 292 | "execution_count": null, 293 | "outputs": [] 294 | }, 295 | { 296 | "cell_type": "code", 297 | "metadata": { 298 | "id": "Zc9UhSTfHDKh", 299 | "colab_type": "code", 300 | "colab": {} 301 | }, 302 | "source": [ 303 | "!pip install git+https://www.github.com/keras-team/keras-contrib.git; pip uninstall -y tensorflow tensorflow-gpu; pip install -U numpy; pip install tensorflow-gpu==1.14.0; pip install -U scipy" 304 | ], 305 | "execution_count": null, 306 | "outputs": [] 307 | }, 308 | { 309 | "cell_type": "markdown", 310 | "metadata": { 311 | "id": "CgNUtCp5YIY2", 312 | "colab_type": "text" 313 | }, 314 | "source": [ 315 | "Download the pre-trained LipGAN model and the Face Detector file" 316 | ] 317 | }, 318 | { 319 | "cell_type": "code", 320 | "metadata": { 321 | "id": "jPFxZlxZATxD", 322 | "colab_type": "code", 323 | "colab": {} 324 | }, 325 | "source": [ 326 | "!wget -N -q https://raw.githubusercontent.com/yhgon/colab_utils/master/gfile.py\n", 327 | "!python gfile.py -u 'https://drive.google.com/open?id=1DtXY5Ei_V6QjrLwfe7YDrmbSCDu6iru1' -f './logs/lipgan_residual_mel.h5'\n", 328 | "!wget 'http://dlib.net/files/mmod_human_face_detector.dat.bz2' -P './logs/'\n", 329 | "!bunzip2 './logs/mmod_human_face_detector.dat.bz2'" 330 | ], 331 | "execution_count": null, 332 | "outputs": [] 333 | }, 334 | { 335 | "cell_type": "code", 336 | "metadata": { 337 | "id": "ZZrUOuQYCOKa", 338 | "colab_type": "code", 339 | "colab": {} 340 | }, 341 | "source": [ 342 | "%cd /content/LipGAN\n", 343 | "!python batch_inference.py --checkpoint_path logs/lipgan_residual_mel.h5 --model residual --face \"/content/person.jpg\" --audio /content/speech.wav --results_dir /content\n", 344 | "\n", 345 | "!ffmpeg -i /content/result_voice.avi /content/result_voice.mp4\n", 346 | "from IPython.display import HTML\n", 347 | "from base64 import b64encode\n", 348 | "mp4 = open('/content/result_voice.mp4','rb').read()\n", 349 | "data_url = \"data:video/mp4;base64,\" + b64encode(mp4).decode()\n", 350 | "HTML(\"\"\"\"\"\" % data_url)" 351 | ], 352 | "execution_count": null, 353 | "outputs": [] 354 | }, 355 | { 356 | "cell_type": "markdown", 357 | "metadata": { 358 | "id": "O9fiUVDwvR75", 359 | "colab_type": "text" 360 | }, 361 | "source": [ 362 | "# Now try it out yourself\n", 363 | "Execute the following code cell and this time insert the question yourself in the text prompt. Save the previous results before running again as they will be overridden." 364 | ] 365 | }, 366 | { 367 | "cell_type": "code", 368 | "metadata": { 369 | "id": "wprCp94ovbTH", 370 | "colab_type": "code", 371 | "colab": {} 372 | }, 373 | "source": [ 374 | "%cd /content/\n", 375 | "!rm speech.wav person.jpg result.avi result_voice.avi result_voice.mp4\n", 376 | "!cp /usr/lib/chromium-browser/chromedriver /usr/bin\n", 377 | "import sys\n", 378 | "sys.path.insert(0,'/usr/lib/chromium-browser/chromedriver')\n", 379 | "from selenium import webdriver\n", 380 | "chrome_options = webdriver.ChromeOptions()\n", 381 | "chrome_options.add_argument('--headless')\n", 382 | "chrome_options.add_argument('--no-sandbox')\n", 383 | "chrome_options.add_argument('--disable-dev-shm-usage')\n", 384 | "driver = webdriver.Chrome('chromedriver',chrome_options=chrome_options)\n", 385 | "driver.get(\"https://thispersondoesnotexist.com/\")\n", 386 | "import time\n", 387 | "time.sleep(5)\n", 388 | "button = driver.find_element_by_id('saveButton')\n", 389 | "from selenium.webdriver.common.action_chains import ActionChains\n", 390 | "ActionChains(driver).move_to_element(button).click(button).perform()\n", 391 | "time.sleep(4)\n", 392 | "\n", 393 | "from PIL import Image, ImageOps\n", 394 | "original_image = Image.open(\"person.jpg\")\n", 395 | "size = (256,256)\n", 396 | "resized_image = ImageOps.fit(original_image, size, Image.ANTIALIAS)\n", 397 | "image = resized_image.convert('RGB')\n", 398 | "image.save(\"person.jpg\")\n", 399 | "\n", 400 | "prompt = input(\"Ask a question: \")\n", 401 | "\n", 402 | "driver.get(\"http://textsynth.org/\")\n", 403 | "driver.implicitly_wait(10)\n", 404 | "inputElement = driver.find_element_by_id('input_text')\n", 405 | "inputElement.click()\n", 406 | "inputElement.clear()\n", 407 | "inputElement.send_keys(prompt)\n", 408 | "button = driver.find_element_by_id('submit_button')\n", 409 | "ActionChains(driver).move_to_element(button).click(button).perform()\n", 410 | "time.sleep(10)\n", 411 | "responseElement = driver.find_element_by_id('gtext')\n", 412 | "response = responseElement.text\n", 413 | "response = response[len(prompt):].replace('\\n', ' ')\n", 414 | "\n", 415 | "%cd /content/flowtron\n", 416 | "tts_text = response.replace('\\n',' ').replace('\"','')\n", 417 | "!source flowtronenv/bin/activate; python inference.py -c config.json -f models/flowtron_ljs.pt -w models/waveglow_256channels_v4.pt -t \"$tts_text\" -i 0\n", 418 | "\n", 419 | "!cp './results/sid0_sigma0.5.wav' ./..\n", 420 | "%cd ..\n", 421 | "!mv './sid0_sigma0.5.wav' './speech.wav'\n", 422 | "\n", 423 | "%cd /content/LipGAN\n", 424 | "!python batch_inference.py --checkpoint_path logs/lipgan_residual_mel.h5 --model residual --face \"/content/person.jpg\" --audio /content/speech.wav --results_dir /content\n", 425 | "\n", 426 | "!ffmpeg -i /content/result_voice.avi /content/result_voice.mp4\n", 427 | "from IPython.display import HTML\n", 428 | "from base64 import b64encode\n", 429 | "mp4 = open('/content/result_voice.mp4','rb').read()\n", 430 | "data_url = \"data:video/mp4;base64,\" + b64encode(mp4).decode()\n", 431 | "HTML(\"\"\"\"\"\" % data_url)" 432 | ], 433 | "execution_count": null, 434 | "outputs": [] 435 | }, 436 | { 437 | "cell_type": "markdown", 438 | "metadata": { 439 | "id": "OZF1fhS36wxM", 440 | "colab_type": "text" 441 | }, 442 | "source": [ 443 | "# Experimental Code (ignore)\n", 444 | "Details: split the input text according to Flowtron's token length to avoid dropping of audio sequences." 445 | ] 446 | }, 447 | { 448 | "cell_type": "code", 449 | "metadata": { 450 | "id": "fxsDrwwJ61C7", 451 | "colab_type": "code", 452 | "colab": {} 453 | }, 454 | "source": [ 455 | "%cd /content/flowtron/\n", 456 | "import librosa\n", 457 | "import numpy as np\n", 458 | "import textwrap\n", 459 | "\n", 460 | "response = 'As a health expert, I predict that the pandemic will happen in the 2020s, and it is possible that it may happen in the 2050s. What I think will happen is that pandemic will cause an increase in deaths from diseases that have been eradicated by vaccines or by conventional medicine. Will there still be any deaths from the new viral pathogens? Yes, we will still have many deaths from the new viruses.'\n", 461 | "tts_text = response.replace('\\n',' ').replace('\\'','').replace('\\\"','')\n", 462 | "token_limit = 80\n", 463 | "tts_text_wrap = textwrap.wrap(tts_text, token_limit)\n", 464 | "\n", 465 | "for it, tts_input in enumerate(tts_text_wrap):\n", 466 | " print(tts_input)\n", 467 | " \n", 468 | " # TODO: Of course we want to load the model only once for multiple inferences, but the virtualenv session gets \n", 469 | " # deactivated after every line in Colab, so developing a workaround for that would require some work.\n", 470 | " !source flowtronenv/bin/activate; python inference.py -c config.json -f models/flowtron_ljs.pt -w models/waveglow_256channels_v4.pt -t \"$tts_input\" -i 0\n", 471 | " !mv './results/sid0_sigma0.5.wav' './results/speech{it}.wav'\n", 472 | " # if it: \n", 473 | " # x, sr = librosa.load('./results/speech.wav')\n", 474 | " # y, sr = librosa.load('./results/sid0_sigma0.5.wav')\n", 475 | " # z = np.append(x,y)\n", 476 | " # librosa.output.write_wav('./results/speech.wav', z, sr)\n", 477 | " # else:\n", 478 | " # !mv './results/sid0_sigma0.5.wav' './results/speech.wav'" 479 | ], 480 | "execution_count": null, 481 | "outputs": [] 482 | } 483 | ] 484 | } -------------------------------------------------------------------------------- /res/combined.gif: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ChintanTrivedi/ask-fake-ai-karen/993a73f39bd1d56e64ed93f894fc1aced7915111/res/combined.gif --------------------------------------------------------------------------------