├── audio-data └── README.md ├── video-data └── README.md ├── preprocessed-audio-data └── README.md ├── preprocessed-video-data └── README.md ├── 576x576-CorrespondingVideo.jpg ├── LICENSE ├── README.md ├── video-preprocess.py ├── hparams.py └── audio-preprocess.py /audio-data/README.md: -------------------------------------------------------------------------------- 1 | Place audio here,16000hz 2 | -------------------------------------------------------------------------------- /video-data/README.md: -------------------------------------------------------------------------------- 1 | Place the video here ,25fps 2 | -------------------------------------------------------------------------------- /preprocessed-audio-data/README.md: -------------------------------------------------------------------------------- 1 | Audio HuBERT processed files. 2 | -------------------------------------------------------------------------------- /preprocessed-video-data/README.md: -------------------------------------------------------------------------------- 1 | After processing each frame of the video, the face is cut out to a size of 576x576 2 | -------------------------------------------------------------------------------- /576x576-CorrespondingVideo.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/langzizhixin/wav2lip-576x576/HEAD/576x576-CorrespondingVideo.jpg -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2024 Sam Nguyen 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # wav2lip-576x576 introduction 2 | This is a project about talking faces. We use 576X576 sized facial images for training, which can generate 2k, 4k, 6k, and 8k digital human videos. 3 | 4 | We have optimized in the following areas: 5 | 1. Using Hubert for audio processing, there is a significant improvement compared to wav2lip-96 and wav2lip-288. 6 | 2. Optimized dataset processing, eliminating the need to manually cut videos into seconds. 7 | 3. We have optimized the network structure to better extract features,Our idea is not to train the discriminator separately, but to train the generator directly.. 8 | 4. We trained the base model with a high-definition dataset of hundreds of people. Although its generalization ability is not strong, the effect is very good after single or multi person fine-tuning. 9 | 10 | # wav2lip-576x576 Project situation 11 |

12 | 13 | Video 14 | | 15 | Project Page 16 | | 17 | Code 18 | 19 |

20 |

21 | 22 |

23 | 24 | # wav2lip-576x576 Code Release Plan 25 | This project is not yet mature enough. 26 | We will gradually release the code, first release the data processing code, then release the inference code, and when the time is ripe, we will release the training code. 27 | 28 | # acknowledge 29 | The code is mainly borrowed from wav2lip, wav2lip-288, wav2lip-384, ER-NeRF, etc. 30 | Thank you for their wonderful work. 31 | 32 | # author 33 | Project made by Lu Rui from Langzizhixin Technology company in Chengdu, China, 2024. 34 | 35 | # Code contribution 36 | At present, the video preprocessing, facial cropping, and audio Hubert processing codes have been completed. Welcome everyone to contribute code related to network structure, training, and inference. 37 | -------------------------------------------------------------------------------- /video-preprocess.py: -------------------------------------------------------------------------------- 1 | import sys 2 | 3 | if sys.version_info[0] < 3 and sys.version_info[1] < 2: 4 | raise Exception("Must be using >= Python 3.2") 5 | 6 | from os import listdir, path 7 | 8 | if not path.isfile('face_detection/detection/sfd/s3fd.pth'): 9 | raise FileNotFoundError('Save the s3fd model to face_detection/detection/sfd/s3fd.pth \ 10 | before running this script!') 11 | 12 | import multiprocessing as mp 13 | from concurrent.futures import ThreadPoolExecutor, as_completed 14 | import numpy as np 15 | import argparse, os, cv2, traceback, subprocess 16 | from tqdm import tqdm 17 | from glob import glob 18 | import audio 19 | from hparams import hparams as hp 20 | 21 | import face_detection 22 | 23 | parser = argparse.ArgumentParser() 24 | 25 | parser.add_argument('--ngpu', help='Number of GPUs across which to run in parallel', default=1, type=int) 26 | parser.add_argument('--batch_size', help='Single GPU Face detection batch size', default=32, type=int) 27 | parser.add_argument("--data_root", help="Root folder of the LRS2 dataset", required=True) 28 | parser.add_argument("--preprocessed_root", help="Root folder of the preprocessed dataset", required=True) 29 | 30 | args = parser.parse_args() 31 | 32 | fa = [face_detection.FaceAlignment(face_detection.LandmarksType._2D, flip_input=False, 33 | device='cuda:{}'.format(id)) for id in range(args.ngpu)] 34 | 35 | template = 'ffmpeg -loglevel panic -y -i {} -strict -2 {}' 36 | # template2 = 'ffmpeg -hide_banner -loglevel panic -threads 1 -y -i {} -async 1 -ac 1 -vn -acodec pcm_s16le -ar 16000 {}' 37 | 38 | def process_video_file(vfile, args, gpu_id): 39 | video_stream = cv2.VideoCapture(vfile) 40 | 41 | frames = [] 42 | while 1: 43 | still_reading, frame = video_stream.read() 44 | if not still_reading: 45 | video_stream.release() 46 | break 47 | frames.append(frame) 48 | 49 | vidname = os.path.basename(vfile).split('.')[0] 50 | dirname = vfile.split('/')[-2] 51 | 52 | fulldir = path.join(args.preprocessed_root, dirname, vidname) 53 | os.makedirs(fulldir, exist_ok=True) 54 | 55 | batches = [frames[i:i + args.batch_size] for i in range(0, len(frames), args.batch_size)] 56 | 57 | i = -1 58 | for fb in batches: 59 | preds = fa[gpu_id].get_detections_for_batch(np.asarray(fb)) 60 | 61 | for j, f in enumerate(preds): 62 | i += 1 63 | if f is None: 64 | continue 65 | 66 | x1, y1, x2, y2 = f 67 | cv2.imwrite(path.join(fulldir, '{}.jpg'.format(i)), fb[j][y1:y2, x1:x2]) 68 | 69 | def process_audio_file(vfile, args): 70 | vidname = os.path.basename(vfile).split('.')[0] 71 | dirname = vfile.split('/')[-2] 72 | 73 | fulldir = path.join(args.preprocessed_root, dirname, vidname) 74 | os.makedirs(fulldir, exist_ok=True) 75 | 76 | wavpath = path.join(fulldir, 'audio.wav') 77 | 78 | command = template.format(vfile, wavpath) 79 | subprocess.call(command, shell=True) 80 | 81 | 82 | def mp_handler(job): 83 | vfile, args, gpu_id = job 84 | try: 85 | process_video_file(vfile, args, gpu_id) 86 | except KeyboardInterrupt: 87 | exit(0) 88 | except: 89 | traceback.print_exc() 90 | 91 | def main(args): 92 | print('Started processing for {} with {} GPUs'.format(args.data_root, args.ngpu)) 93 | 94 | filelist = glob(path.join(args.data_root, '*/*.mp4')) 95 | 96 | jobs = [(vfile, args, i%args.ngpu) for i, vfile in enumerate(filelist)] 97 | p = ThreadPoolExecutor(args.ngpu) 98 | futures = [p.submit(mp_handler, j) for j in jobs] 99 | _ = [r.result() for r in tqdm(as_completed(futures), total=len(futures))] 100 | 101 | print('Dumping audios...') 102 | 103 | for vfile in tqdm(filelist): 104 | try: 105 | process_audio_file(vfile, args) 106 | except KeyboardInterrupt: 107 | exit(0) 108 | except: 109 | traceback.print_exc() 110 | continue 111 | 112 | if __name__ == '__main__': 113 | main(args) -------------------------------------------------------------------------------- /hparams.py: -------------------------------------------------------------------------------- 1 | from glob import glob 2 | import os 3 | 4 | def get_image_list(file_list_txt): 5 | filelist = [] 6 | with open(file_list_txt) as f: 7 | for line in f: 8 | line = line.strip() 9 | if ' ' in line: line = line.split()[0] 10 | filelist.append(line) 11 | 12 | return filelist 13 | 14 | class HParams: 15 | def __init__(self, **kwargs): 16 | self.data = {} 17 | 18 | for key, value in kwargs.items(): 19 | self.data[key] = value 20 | 21 | def __getattr__(self, key): 22 | if key not in self.data: 23 | raise AttributeError("'HParams' object has no attribute %s" % key) 24 | return self.data[key] 25 | 26 | def set_hparam(self, key, value): 27 | self.data[key] = value 28 | 29 | 30 | # Default hyperparameters 31 | hparams = HParams( 32 | num_mels=80, # Number of mel-spectrogram channels and local conditioning dimensionality 33 | # network 34 | rescale=True, # Whether to rescale audio prior to preprocessing 35 | rescaling_max=0.9, # Rescaling value 36 | 37 | # Use LWS (https://github.com/Jonathan-LeRoux/lws) for STFT and phase reconstruction 38 | # It"s preferred to set True to use with https://github.com/r9y9/wavenet_vocoder 39 | # Does not work if n_ffit is not multiple of hop_size!! 40 | use_lws=False, 41 | 42 | n_fft=800, # Extra window size is filled with 0 paddings to match this parameter 43 | hop_size=200, # For 16000Hz, 200 = 12.5 ms (0.0125 * sample_rate) 44 | win_size=800, # For 16000Hz, 800 = 50 ms (If None, win_size = n_fft) (0.05 * sample_rate) 45 | sample_rate=16000, # 16000Hz (corresponding to librispeech) (sox --i ) 46 | 47 | frame_shift_ms=None, # Can replace hop_size parameter. (Recommended: 12.5) 48 | 49 | # Mel and Linear spectrograms normalization/scaling and clipping 50 | signal_normalization=True, 51 | # Whether to normalize mel spectrograms to some predefined range (following below parameters) 52 | allow_clipping_in_normalization=True, # Only relevant if mel_normalization = True 53 | symmetric_mels=True, 54 | # Whether to scale the data to be symmetric around 0. (Also multiplies the output range by 2, 55 | # faster and cleaner convergence) 56 | max_abs_value=4., 57 | # max absolute value of data. If symmetric, data will be [-max, max] else [0, max] (Must not 58 | # be too big to avoid gradient explosion, 59 | # not too small for fast convergence) 60 | # Contribution by @begeekmyfriend 61 | # Spectrogram Pre-Emphasis (Lfilter: Reduce spectrogram noise and helps model certitude 62 | # levels. Also allows for better G&L phase reconstruction) 63 | preemphasize=True, # whether to apply filter 64 | preemphasis=0.97, # filter coefficient. 65 | 66 | # Limits 67 | min_level_db=-100, 68 | ref_level_db=20, 69 | fmin=55, 70 | # Set this to 55 if your speaker is male! if female, 95 should help taking off noise. (To 71 | # test depending on dataset. Pitch info: male~[65, 260], female~[100, 525]) 72 | fmax=7600, # To be increased/reduced depending on data. 73 | 74 | ###################### Our training parameters ################################# 75 | img_size=576, 76 | fps=25, 77 | 78 | batch_size=16, 79 | initial_learning_rate=1e-4, 80 | nepochs=200000000000000000, ### ctrl + c, stop whenever eval loss is consistently greater than train loss for ~10 epochs 81 | num_workers=16, 82 | checkpoint_interval=3000, 83 | eval_interval=3000, 84 | save_optimizer_state=True, 85 | 86 | syncnet_wt=0.0, # is initially zero, will be set automatically to 0.03 later. Leads to faster convergence. 87 | syncnet_batch_size=64, 88 | syncnet_lr=1e-4, 89 | syncnet_eval_interval=10000, 90 | syncnet_checkpoint_interval=10000, 91 | 92 | disc_wt=0.07, 93 | disc_initial_learning_rate=1e-4, 94 | ) 95 | 96 | 97 | def hparams_debug_string(): 98 | values = hparams.values() 99 | hp = [" %s: %s" % (name, values[name]) for name in sorted(values) if name != "sentences"] 100 | return "Hyperparameters:\n" + "\n".join(hp) -------------------------------------------------------------------------------- /audio-preprocess.py: -------------------------------------------------------------------------------- 1 | from transformers import Wav2Vec2Processor, HubertModel 2 | import soundfile as sf 3 | import numpy as np 4 | import torch 5 | 6 | print("Loading the Wav2Vec2 Processor...") 7 | wav2vec2_processor = Wav2Vec2Processor.from_pretrained("facebook/hubert-large-ls960-ft") 8 | print("Loading the HuBERT Model...") 9 | hubert_model = HubertModel.from_pretrained("facebook/hubert-large-ls960-ft") 10 | 11 | 12 | def get_hubert_from_16k_wav(wav_16k_name): 13 | speech_16k, _ = sf.read(wav_16k_name) 14 | hubert = get_hubert_from_16k_speech(speech_16k) 15 | return hubert 16 | 17 | @torch.no_grad() 18 | def get_hubert_from_16k_speech(speech, device="cuda:0"): 19 | global hubert_model 20 | hubert_model = hubert_model.to(device) 21 | if speech.ndim ==2: 22 | speech = speech[:, 0] # [T, 2] ==> [T,] 23 | input_values_all = wav2vec2_processor(speech, return_tensors="pt", sampling_rate=16000).input_values # [1, T] 24 | input_values_all = input_values_all.to(device) 25 | # For long audio sequence, due to the memory limitation, we cannot process them in one run 26 | # HuBERT process the wav with a CNN of stride [5,2,2,2,2,2], making a stride of 320 27 | # Besides, the kernel is [10,3,3,3,3,2,2], making 400 a fundamental unit to get 1 time step. 28 | # So the CNN is euqal to a big Conv1D with kernel k=400 and stride s=320 29 | # We have the equation to calculate out time step: T = floor((t-k)/s) 30 | # To prevent overlap, we set each clip length of (K+S*(N-1)), where N is the expected length T of this clip 31 | # The start point of next clip should roll back with a length of (kernel-stride) so it is stride * N 32 | kernel = 400 33 | stride = 320 34 | clip_length = stride * 1000 35 | num_iter = input_values_all.shape[1] // clip_length 36 | expected_T = (input_values_all.shape[1] - (kernel-stride)) // stride 37 | res_lst = [] 38 | for i in range(num_iter): 39 | if i == 0: 40 | start_idx = 0 41 | end_idx = clip_length - stride + kernel 42 | else: 43 | start_idx = clip_length * i 44 | end_idx = start_idx + (clip_length - stride + kernel) 45 | input_values = input_values_all[:, start_idx: end_idx] 46 | hidden_states = hubert_model.forward(input_values).last_hidden_state # [B=1, T=pts//320, hid=1024] 47 | res_lst.append(hidden_states[0]) 48 | if num_iter > 0: 49 | input_values = input_values_all[:, clip_length * num_iter:] 50 | else: 51 | input_values = input_values_all 52 | # if input_values.shape[1] != 0: 53 | if input_values.shape[1] >= kernel: # if the last batch is shorter than kernel_size, skip it 54 | hidden_states = hubert_model(input_values).last_hidden_state # [B=1, T=pts//320, hid=1024] 55 | res_lst.append(hidden_states[0]) 56 | ret = torch.cat(res_lst, dim=0).cpu() # [T, 1024] 57 | # assert ret.shape[0] == expected_T 58 | assert abs(ret.shape[0] - expected_T) <= 1 59 | if ret.shape[0] < expected_T: 60 | ret = torch.nn.functional.pad(ret, (0,0,0,expected_T-ret.shape[0])) 61 | else: 62 | ret = ret[:expected_T] 63 | return ret 64 | 65 | def make_even_first_dim(tensor): 66 | size = list(tensor.size()) 67 | if size[0] % 2 == 1: 68 | size[0] -= 1 69 | return tensor[:size[0]] 70 | return tensor 71 | 72 | import soundfile as sf 73 | import numpy as np 74 | import torch 75 | from argparse import ArgumentParser 76 | import librosa 77 | 78 | parser = ArgumentParser() 79 | parser.add_argument('--wav', type=str, help='') 80 | args = parser.parse_args() 81 | 82 | wav_name = args.wav 83 | 84 | speech, sr = sf.read(wav_name) 85 | speech_16k = librosa.resample(speech, orig_sr=sr, target_sr=16000) 86 | print("SR: {} to {}".format(sr, 16000)) 87 | # print(speech.shape, speech_16k.shape) 88 | 89 | hubert_hidden = get_hubert_from_16k_speech(speech_16k) 90 | hubert_hidden = make_even_first_dim(hubert_hidden).reshape(-1, 2, 1024) 91 | np.save(wav_name.replace('.wav', '_hu.npy'), hubert_hidden.detach().numpy()) 92 | print(hubert_hidden.detach().numpy().shape) --------------------------------------------------------------------------------