├── audio-data
    └── README.md
├── video-data
    └── README.md
├── preprocessed-audio-data
    └── README.md
├── preprocessed-video-data
    └── README.md
├── 576x576-CorrespondingVideo.jpg
├── LICENSE
├── README.md
├── video-preprocess.py
├── hparams.py
└── audio-preprocess.py


/audio-data/README.md:
--------------------------------------------------------------------------------
1 | Place audio here，16000hz
2 | 


--------------------------------------------------------------------------------
/video-data/README.md:
--------------------------------------------------------------------------------
1 | Place the video here ，25fps
2 | 


--------------------------------------------------------------------------------
/preprocessed-audio-data/README.md:
--------------------------------------------------------------------------------
1 | Audio HuBERT processed files.
2 | 


--------------------------------------------------------------------------------
/preprocessed-video-data/README.md:
--------------------------------------------------------------------------------
1 | After processing each frame of the video, the face is cut out to a size of 576x576
2 | 


--------------------------------------------------------------------------------
/576x576-CorrespondingVideo.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/langzizhixin/wav2lip-576x576/HEAD/576x576-CorrespondingVideo.jpg


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2024 Sam Nguyen
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # wav2lip-576x576 introduction
 2 | This is a project about talking faces. We use 576X576 sized facial images for training, which can generate 2k, 4k, 6k, and 8k digital human  videos.
 3 | 
 4 | We have optimized in the following areas:
 5 | 1. Using Hubert for audio processing, there is a significant improvement compared to wav2lip-96 and wav2lip-288.
 6 | 2. Optimized dataset processing, eliminating the need to manually cut videos into seconds.
 7 | 3. We have optimized the network structure to better extract features,Our idea is not to train the discriminator separately, but to train the generator directly..
 8 | 4. We trained the base model with a high-definition dataset of hundreds of people. Although its generalization ability is not strong, the effect is very good after single or multi person fine-tuning.
 9 | 
10 | # wav2lip-576x576 Project situation
11 | <p align='center'>
12 |   <b>
13 |     <a href="https://www.bilibili.com/video/BV1zK421v7wh/?vd_source=7720ff9e037156b51374d14ee8f76b51">Video </a>
14 |     | 
15 |     <a href="https://github.com/langzizhixin">Project Page</a>
16 |     |
17 |     <a href="https://github.com/langzizhixin/wav2lip-576x576">Code</a> 
18 |   </b>
19 | </p> 
20 |   <p align='center'>  
21 |     <img src='576x576-CorrespondingVideo.jpg' width='1000'/>
22 |   </p>
23 | 
24 | # wav2lip-576x576 Code Release Plan
25 | This project is not yet mature enough.
26 | We will gradually release the code, first release the data processing code, then release the inference code, and when the time is ripe, we will release the training code.
27 | 
28 | # acknowledge
29 | The code is mainly borrowed from wav2lip, wav2lip-288, wav2lip-384, ER-NeRF, etc.
30 | Thank you for their wonderful work.
31 | 
32 | # author
33 | Project  made by Lu Rui from Langzizhixin Technology company in Chengdu, China, 2024.
34 | 
35 | # Code contribution
36 | At present, the video preprocessing, facial cropping, and audio Hubert processing codes have been completed. Welcome everyone to contribute code related to network structure, training, and inference.
37 | 


--------------------------------------------------------------------------------
/video-preprocess.py:
--------------------------------------------------------------------------------
  1 | import sys
  2 | 
  3 | if sys.version_info[0] < 3 and sys.version_info[1] < 2:
  4 | 	raise Exception("Must be using >= Python 3.2")
  5 | 
  6 | from os import listdir, path
  7 | 
  8 | if not path.isfile('face_detection/detection/sfd/s3fd.pth'):
  9 | 	raise FileNotFoundError('Save the s3fd model to face_detection/detection/sfd/s3fd.pth \
 10 | 							before running this script!')
 11 | 
 12 | import multiprocessing as mp
 13 | from concurrent.futures import ThreadPoolExecutor, as_completed
 14 | import numpy as np
 15 | import argparse, os, cv2, traceback, subprocess
 16 | from tqdm import tqdm
 17 | from glob import glob
 18 | import audio
 19 | from hparams import hparams as hp
 20 | 
 21 | import face_detection
 22 | 
 23 | parser = argparse.ArgumentParser()
 24 | 
 25 | parser.add_argument('--ngpu', help='Number of GPUs across which to run in parallel', default=1, type=int)
 26 | parser.add_argument('--batch_size', help='Single GPU Face detection batch size', default=32, type=int)
 27 | parser.add_argument("--data_root", help="Root folder of the LRS2 dataset", required=True)
 28 | parser.add_argument("--preprocessed_root", help="Root folder of the preprocessed dataset", required=True)
 29 | 
 30 | args = parser.parse_args()
 31 | 
 32 | fa = [face_detection.FaceAlignment(face_detection.LandmarksType._2D, flip_input=False, 
 33 | 									device='cuda:{}'.format(id)) for id in range(args.ngpu)]
 34 | 
 35 | template = 'ffmpeg -loglevel panic -y -i {} -strict -2 {}'
 36 | # template2 = 'ffmpeg -hide_banner -loglevel panic -threads 1 -y -i {} -async 1 -ac 1 -vn -acodec pcm_s16le -ar 16000 {}'
 37 | 
 38 | def process_video_file(vfile, args, gpu_id):
 39 | 	video_stream = cv2.VideoCapture(vfile)
 40 | 	
 41 | 	frames = []
 42 | 	while 1:
 43 | 		still_reading, frame = video_stream.read()
 44 | 		if not still_reading:
 45 | 			video_stream.release()
 46 | 			break
 47 | 		frames.append(frame)
 48 | 	
 49 | 	vidname = os.path.basename(vfile).split('.')[0]
 50 | 	dirname = vfile.split('/')[-2]
 51 | 
 52 | 	fulldir = path.join(args.preprocessed_root, dirname, vidname)
 53 | 	os.makedirs(fulldir, exist_ok=True)
 54 | 
 55 | 	batches = [frames[i:i + args.batch_size] for i in range(0, len(frames), args.batch_size)]
 56 | 
 57 | 	i = -1
 58 | 	for fb in batches:
 59 | 		preds = fa[gpu_id].get_detections_for_batch(np.asarray(fb))
 60 | 
 61 | 		for j, f in enumerate(preds):
 62 | 			i += 1
 63 | 			if f is None:
 64 | 				continue
 65 | 
 66 | 			x1, y1, x2, y2 = f
 67 | 			cv2.imwrite(path.join(fulldir, '{}.jpg'.format(i)), fb[j][y1:y2, x1:x2])
 68 | 
 69 | def process_audio_file(vfile, args):
 70 | 	vidname = os.path.basename(vfile).split('.')[0]
 71 | 	dirname = vfile.split('/')[-2]
 72 | 
 73 | 	fulldir = path.join(args.preprocessed_root, dirname, vidname)
 74 | 	os.makedirs(fulldir, exist_ok=True)
 75 | 
 76 | 	wavpath = path.join(fulldir, 'audio.wav')
 77 | 
 78 | 	command = template.format(vfile, wavpath)
 79 | 	subprocess.call(command, shell=True)
 80 | 
 81 | 	
 82 | def mp_handler(job):
 83 | 	vfile, args, gpu_id = job
 84 | 	try:
 85 | 		process_video_file(vfile, args, gpu_id)
 86 | 	except KeyboardInterrupt:
 87 | 		exit(0)
 88 | 	except:
 89 | 		traceback.print_exc()
 90 | 		
 91 | def main(args):
 92 | 	print('Started processing for {} with {} GPUs'.format(args.data_root, args.ngpu))
 93 | 
 94 | 	filelist = glob(path.join(args.data_root, '*/*.mp4'))
 95 | 
 96 | 	jobs = [(vfile, args, i%args.ngpu) for i, vfile in enumerate(filelist)]
 97 | 	p = ThreadPoolExecutor(args.ngpu)
 98 | 	futures = [p.submit(mp_handler, j) for j in jobs]
 99 | 	_ = [r.result() for r in tqdm(as_completed(futures), total=len(futures))]
100 | 
101 | 	print('Dumping audios...')
102 | 
103 | 	for vfile in tqdm(filelist):
104 | 		try:
105 | 			process_audio_file(vfile, args)
106 | 		except KeyboardInterrupt:
107 | 			exit(0)
108 | 		except:
109 | 			traceback.print_exc()
110 | 			continue
111 | 
112 | if __name__ == '__main__':
113 | 	main(args)


--------------------------------------------------------------------------------
/hparams.py:
--------------------------------------------------------------------------------
  1 | from glob import glob
  2 | import os
  3 | 
  4 | def get_image_list(file_list_txt):
  5 | 	filelist = []
  6 | 	with open(file_list_txt) as f:
  7 | 		for line in f:
  8 | 			line = line.strip()
  9 | 			if ' ' in line: line = line.split()[0]
 10 | 			filelist.append(line)
 11 | 
 12 | 	return filelist
 13 | 
 14 | class HParams:
 15 | 	def __init__(self, **kwargs):
 16 | 		self.data = {}
 17 | 
 18 | 		for key, value in kwargs.items():
 19 | 			self.data[key] = value
 20 | 
 21 | 	def __getattr__(self, key):
 22 | 		if key not in self.data:
 23 | 			raise AttributeError("'HParams' object has no attribute %s" % key)
 24 | 		return self.data[key]
 25 | 
 26 | 	def set_hparam(self, key, value):
 27 | 		self.data[key] = value
 28 | 
 29 | 
 30 | # Default hyperparameters
 31 | hparams = HParams(
 32 | 	num_mels=80,  # Number of mel-spectrogram channels and local conditioning dimensionality
 33 | 	#  network
 34 | 	rescale=True,  # Whether to rescale audio prior to preprocessing
 35 | 	rescaling_max=0.9,  # Rescaling value
 36 | 	
 37 | 	# Use LWS (https://github.com/Jonathan-LeRoux/lws) for STFT and phase reconstruction
 38 | 	# It"s preferred to set True to use with https://github.com/r9y9/wavenet_vocoder
 39 | 	# Does not work if n_ffit is not multiple of hop_size!!
 40 | 	use_lws=False,
 41 | 	
 42 | 	n_fft=800,  # Extra window size is filled with 0 paddings to match this parameter
 43 | 	hop_size=200,  # For 16000Hz, 200 = 12.5 ms (0.0125 * sample_rate)
 44 | 	win_size=800,  # For 16000Hz, 800 = 50 ms (If None, win_size = n_fft) (0.05 * sample_rate)
 45 | 	sample_rate=16000,  # 16000Hz (corresponding to librispeech) (sox --i <filename>)
 46 | 	
 47 | 	frame_shift_ms=None,  # Can replace hop_size parameter. (Recommended: 12.5)
 48 | 	
 49 | 	# Mel and Linear spectrograms normalization/scaling and clipping
 50 | 	signal_normalization=True,
 51 | 	# Whether to normalize mel spectrograms to some predefined range (following below parameters)
 52 | 	allow_clipping_in_normalization=True,  # Only relevant if mel_normalization = True
 53 | 	symmetric_mels=True,
 54 | 	# Whether to scale the data to be symmetric around 0. (Also multiplies the output range by 2, 
 55 | 	# faster and cleaner convergence)
 56 | 	max_abs_value=4.,
 57 | 	# max absolute value of data. If symmetric, data will be [-max, max] else [0, max] (Must not 
 58 | 	# be too big to avoid gradient explosion, 
 59 | 	# not too small for fast convergence)
 60 | 	# Contribution by @begeekmyfriend
 61 | 	# Spectrogram Pre-Emphasis (Lfilter: Reduce spectrogram noise and helps model certitude 
 62 | 	# levels. Also allows for better G&L phase reconstruction)
 63 | 	preemphasize=True,  # whether to apply filter
 64 | 	preemphasis=0.97,  # filter coefficient.
 65 | 	
 66 | 	# Limits
 67 | 	min_level_db=-100,
 68 | 	ref_level_db=20,
 69 | 	fmin=55,
 70 | 	# Set this to 55 if your speaker is male! if female, 95 should help taking off noise. (To 
 71 | 	# test depending on dataset. Pitch info: male~[65, 260], female~[100, 525])
 72 | 	fmax=7600,  # To be increased/reduced depending on data.
 73 | 
 74 | 	###################### Our training parameters #################################
 75 | 	img_size=576,
 76 | 	fps=25,
 77 | 	
 78 | 	batch_size=16,
 79 | 	initial_learning_rate=1e-4,
 80 | 	nepochs=200000000000000000,  ### ctrl + c, stop whenever eval loss is consistently greater than train loss for ~10 epochs
 81 | 	num_workers=16,
 82 | 	checkpoint_interval=3000,
 83 | 	eval_interval=3000,
 84 |     save_optimizer_state=True,
 85 | 
 86 |     syncnet_wt=0.0, # is initially zero, will be set automatically to 0.03 later. Leads to faster convergence. 
 87 | 	syncnet_batch_size=64,
 88 | 	syncnet_lr=1e-4,
 89 | 	syncnet_eval_interval=10000,
 90 | 	syncnet_checkpoint_interval=10000,
 91 | 
 92 | 	disc_wt=0.07,
 93 | 	disc_initial_learning_rate=1e-4,
 94 | )
 95 | 
 96 | 
 97 | def hparams_debug_string():
 98 | 	values = hparams.values()
 99 | 	hp = ["  %s: %s" % (name, values[name]) for name in sorted(values) if name != "sentences"]
100 | 	return "Hyperparameters:\n" + "\n".join(hp)


--------------------------------------------------------------------------------
/audio-preprocess.py:
--------------------------------------------------------------------------------
 1 | from transformers import Wav2Vec2Processor, HubertModel
 2 | import soundfile as sf
 3 | import numpy as np
 4 | import torch
 5 | 
 6 | print("Loading the Wav2Vec2 Processor...")
 7 | wav2vec2_processor = Wav2Vec2Processor.from_pretrained("facebook/hubert-large-ls960-ft")
 8 | print("Loading the HuBERT Model...")
 9 | hubert_model = HubertModel.from_pretrained("facebook/hubert-large-ls960-ft")
10 | 
11 | 
12 | def get_hubert_from_16k_wav(wav_16k_name):
13 |     speech_16k, _ = sf.read(wav_16k_name)
14 |     hubert = get_hubert_from_16k_speech(speech_16k)
15 |     return hubert
16 | 
17 | @torch.no_grad()
18 | def get_hubert_from_16k_speech(speech, device="cuda:0"):
19 |     global hubert_model
20 |     hubert_model = hubert_model.to(device)
21 |     if speech.ndim ==2:
22 |         speech = speech[:, 0] # [T, 2] ==> [T,]
23 |     input_values_all = wav2vec2_processor(speech, return_tensors="pt", sampling_rate=16000).input_values # [1, T]
24 |     input_values_all = input_values_all.to(device)
25 |     # For long audio sequence, due to the memory limitation, we cannot process them in one run
26 |     # HuBERT process the wav with a CNN of stride [5,2,2,2,2,2], making a stride of 320
27 |     # Besides, the kernel is [10,3,3,3,3,2,2], making 400 a fundamental unit to get 1 time step.
28 |     # So the CNN is euqal to a big Conv1D with kernel k=400 and stride s=320
29 |     # We have the equation to calculate out time step: T = floor((t-k)/s)
30 |     # To prevent overlap, we set each clip length of (K+S*(N-1)), where N is the expected length T of this clip
31 |     # The start point of next clip should roll back with a length of (kernel-stride) so it is stride * N
32 |     kernel = 400
33 |     stride = 320
34 |     clip_length = stride * 1000
35 |     num_iter = input_values_all.shape[1] // clip_length
36 |     expected_T = (input_values_all.shape[1] - (kernel-stride)) // stride
37 |     res_lst = []
38 |     for i in range(num_iter):
39 |         if i == 0:
40 |             start_idx = 0
41 |             end_idx = clip_length - stride + kernel
42 |         else:
43 |             start_idx = clip_length * i
44 |             end_idx = start_idx + (clip_length - stride + kernel)
45 |         input_values = input_values_all[:, start_idx: end_idx]
46 |         hidden_states = hubert_model.forward(input_values).last_hidden_state # [B=1, T=pts//320, hid=1024]
47 |         res_lst.append(hidden_states[0])
48 |     if num_iter > 0:
49 |         input_values = input_values_all[:, clip_length * num_iter:]
50 |     else:
51 |         input_values = input_values_all
52 |     # if input_values.shape[1] != 0:
53 |     if input_values.shape[1] >= kernel: # if the last batch is shorter than kernel_size, skip it            
54 |         hidden_states = hubert_model(input_values).last_hidden_state # [B=1, T=pts//320, hid=1024]
55 |         res_lst.append(hidden_states[0])
56 |     ret = torch.cat(res_lst, dim=0).cpu() # [T, 1024]
57 |     # assert ret.shape[0] == expected_T
58 |     assert abs(ret.shape[0] - expected_T) <= 1
59 |     if ret.shape[0] < expected_T:
60 |         ret = torch.nn.functional.pad(ret, (0,0,0,expected_T-ret.shape[0]))
61 |     else:
62 |         ret = ret[:expected_T]
63 |     return ret
64 | 
65 | def make_even_first_dim(tensor):
66 |     size = list(tensor.size())
67 |     if size[0] % 2 == 1:
68 |         size[0] -= 1
69 |         return tensor[:size[0]]
70 |     return tensor
71 | 
72 | import soundfile as sf
73 | import numpy as np
74 | import torch
75 | from argparse import ArgumentParser
76 | import librosa
77 | 
78 | parser = ArgumentParser()
79 | parser.add_argument('--wav', type=str, help='')
80 | args = parser.parse_args()
81 | 
82 | wav_name = args.wav
83 | 
84 | speech, sr = sf.read(wav_name)
85 | speech_16k = librosa.resample(speech, orig_sr=sr, target_sr=16000)
86 | print("SR: {} to {}".format(sr, 16000))
87 | # print(speech.shape, speech_16k.shape)
88 | 
89 | hubert_hidden = get_hubert_from_16k_speech(speech_16k)
90 | hubert_hidden = make_even_first_dim(hubert_hidden).reshape(-1, 2, 1024)
91 | np.save(wav_name.replace('.wav', '_hu.npy'), hubert_hidden.detach().numpy())
92 | print(hubert_hidden.detach().numpy().shape)


--------------------------------------------------------------------------------