├── audio-data
└── README.md
├── video-data
└── README.md
├── preprocessed-audio-data
└── README.md
├── preprocessed-video-data
└── README.md
├── 576x576-CorrespondingVideo.jpg
├── LICENSE
├── README.md
├── video-preprocess.py
├── hparams.py
└── audio-preprocess.py
/audio-data/README.md:
--------------------------------------------------------------------------------
1 | Place audio here,16000hz
2 |
--------------------------------------------------------------------------------
/video-data/README.md:
--------------------------------------------------------------------------------
1 | Place the video here ,25fps
2 |
--------------------------------------------------------------------------------
/preprocessed-audio-data/README.md:
--------------------------------------------------------------------------------
1 | Audio HuBERT processed files.
2 |
--------------------------------------------------------------------------------
/preprocessed-video-data/README.md:
--------------------------------------------------------------------------------
1 | After processing each frame of the video, the face is cut out to a size of 576x576
2 |
--------------------------------------------------------------------------------
/576x576-CorrespondingVideo.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/langzizhixin/wav2lip-576x576/HEAD/576x576-CorrespondingVideo.jpg
--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
1 | MIT License
2 |
3 | Copyright (c) 2024 Sam Nguyen
4 |
5 | Permission is hereby granted, free of charge, to any person obtaining a copy
6 | of this software and associated documentation files (the "Software"), to deal
7 | in the Software without restriction, including without limitation the rights
8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 |
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 |
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # wav2lip-576x576 introduction
2 | This is a project about talking faces. We use 576X576 sized facial images for training, which can generate 2k, 4k, 6k, and 8k digital human videos.
3 |
4 | We have optimized in the following areas:
5 | 1. Using Hubert for audio processing, there is a significant improvement compared to wav2lip-96 and wav2lip-288.
6 | 2. Optimized dataset processing, eliminating the need to manually cut videos into seconds.
7 | 3. We have optimized the network structure to better extract features,Our idea is not to train the discriminator separately, but to train the generator directly..
8 | 4. We trained the base model with a high-definition dataset of hundreds of people. Although its generalization ability is not strong, the effect is very good after single or multi person fine-tuning.
9 |
10 | # wav2lip-576x576 Project situation
11 |
12 |
13 | Video
14 | |
15 | Project Page
16 | |
17 | Code
18 |
19 |
20 |
21 |
22 |
23 |
24 | # wav2lip-576x576 Code Release Plan
25 | This project is not yet mature enough.
26 | We will gradually release the code, first release the data processing code, then release the inference code, and when the time is ripe, we will release the training code.
27 |
28 | # acknowledge
29 | The code is mainly borrowed from wav2lip, wav2lip-288, wav2lip-384, ER-NeRF, etc.
30 | Thank you for their wonderful work.
31 |
32 | # author
33 | Project made by Lu Rui from Langzizhixin Technology company in Chengdu, China, 2024.
34 |
35 | # Code contribution
36 | At present, the video preprocessing, facial cropping, and audio Hubert processing codes have been completed. Welcome everyone to contribute code related to network structure, training, and inference.
37 |
--------------------------------------------------------------------------------
/video-preprocess.py:
--------------------------------------------------------------------------------
1 | import sys
2 |
3 | if sys.version_info[0] < 3 and sys.version_info[1] < 2:
4 | raise Exception("Must be using >= Python 3.2")
5 |
6 | from os import listdir, path
7 |
8 | if not path.isfile('face_detection/detection/sfd/s3fd.pth'):
9 | raise FileNotFoundError('Save the s3fd model to face_detection/detection/sfd/s3fd.pth \
10 | before running this script!')
11 |
12 | import multiprocessing as mp
13 | from concurrent.futures import ThreadPoolExecutor, as_completed
14 | import numpy as np
15 | import argparse, os, cv2, traceback, subprocess
16 | from tqdm import tqdm
17 | from glob import glob
18 | import audio
19 | from hparams import hparams as hp
20 |
21 | import face_detection
22 |
23 | parser = argparse.ArgumentParser()
24 |
25 | parser.add_argument('--ngpu', help='Number of GPUs across which to run in parallel', default=1, type=int)
26 | parser.add_argument('--batch_size', help='Single GPU Face detection batch size', default=32, type=int)
27 | parser.add_argument("--data_root", help="Root folder of the LRS2 dataset", required=True)
28 | parser.add_argument("--preprocessed_root", help="Root folder of the preprocessed dataset", required=True)
29 |
30 | args = parser.parse_args()
31 |
32 | fa = [face_detection.FaceAlignment(face_detection.LandmarksType._2D, flip_input=False,
33 | device='cuda:{}'.format(id)) for id in range(args.ngpu)]
34 |
35 | template = 'ffmpeg -loglevel panic -y -i {} -strict -2 {}'
36 | # template2 = 'ffmpeg -hide_banner -loglevel panic -threads 1 -y -i {} -async 1 -ac 1 -vn -acodec pcm_s16le -ar 16000 {}'
37 |
38 | def process_video_file(vfile, args, gpu_id):
39 | video_stream = cv2.VideoCapture(vfile)
40 |
41 | frames = []
42 | while 1:
43 | still_reading, frame = video_stream.read()
44 | if not still_reading:
45 | video_stream.release()
46 | break
47 | frames.append(frame)
48 |
49 | vidname = os.path.basename(vfile).split('.')[0]
50 | dirname = vfile.split('/')[-2]
51 |
52 | fulldir = path.join(args.preprocessed_root, dirname, vidname)
53 | os.makedirs(fulldir, exist_ok=True)
54 |
55 | batches = [frames[i:i + args.batch_size] for i in range(0, len(frames), args.batch_size)]
56 |
57 | i = -1
58 | for fb in batches:
59 | preds = fa[gpu_id].get_detections_for_batch(np.asarray(fb))
60 |
61 | for j, f in enumerate(preds):
62 | i += 1
63 | if f is None:
64 | continue
65 |
66 | x1, y1, x2, y2 = f
67 | cv2.imwrite(path.join(fulldir, '{}.jpg'.format(i)), fb[j][y1:y2, x1:x2])
68 |
69 | def process_audio_file(vfile, args):
70 | vidname = os.path.basename(vfile).split('.')[0]
71 | dirname = vfile.split('/')[-2]
72 |
73 | fulldir = path.join(args.preprocessed_root, dirname, vidname)
74 | os.makedirs(fulldir, exist_ok=True)
75 |
76 | wavpath = path.join(fulldir, 'audio.wav')
77 |
78 | command = template.format(vfile, wavpath)
79 | subprocess.call(command, shell=True)
80 |
81 |
82 | def mp_handler(job):
83 | vfile, args, gpu_id = job
84 | try:
85 | process_video_file(vfile, args, gpu_id)
86 | except KeyboardInterrupt:
87 | exit(0)
88 | except:
89 | traceback.print_exc()
90 |
91 | def main(args):
92 | print('Started processing for {} with {} GPUs'.format(args.data_root, args.ngpu))
93 |
94 | filelist = glob(path.join(args.data_root, '*/*.mp4'))
95 |
96 | jobs = [(vfile, args, i%args.ngpu) for i, vfile in enumerate(filelist)]
97 | p = ThreadPoolExecutor(args.ngpu)
98 | futures = [p.submit(mp_handler, j) for j in jobs]
99 | _ = [r.result() for r in tqdm(as_completed(futures), total=len(futures))]
100 |
101 | print('Dumping audios...')
102 |
103 | for vfile in tqdm(filelist):
104 | try:
105 | process_audio_file(vfile, args)
106 | except KeyboardInterrupt:
107 | exit(0)
108 | except:
109 | traceback.print_exc()
110 | continue
111 |
112 | if __name__ == '__main__':
113 | main(args)
--------------------------------------------------------------------------------
/hparams.py:
--------------------------------------------------------------------------------
1 | from glob import glob
2 | import os
3 |
4 | def get_image_list(file_list_txt):
5 | filelist = []
6 | with open(file_list_txt) as f:
7 | for line in f:
8 | line = line.strip()
9 | if ' ' in line: line = line.split()[0]
10 | filelist.append(line)
11 |
12 | return filelist
13 |
14 | class HParams:
15 | def __init__(self, **kwargs):
16 | self.data = {}
17 |
18 | for key, value in kwargs.items():
19 | self.data[key] = value
20 |
21 | def __getattr__(self, key):
22 | if key not in self.data:
23 | raise AttributeError("'HParams' object has no attribute %s" % key)
24 | return self.data[key]
25 |
26 | def set_hparam(self, key, value):
27 | self.data[key] = value
28 |
29 |
30 | # Default hyperparameters
31 | hparams = HParams(
32 | num_mels=80, # Number of mel-spectrogram channels and local conditioning dimensionality
33 | # network
34 | rescale=True, # Whether to rescale audio prior to preprocessing
35 | rescaling_max=0.9, # Rescaling value
36 |
37 | # Use LWS (https://github.com/Jonathan-LeRoux/lws) for STFT and phase reconstruction
38 | # It"s preferred to set True to use with https://github.com/r9y9/wavenet_vocoder
39 | # Does not work if n_ffit is not multiple of hop_size!!
40 | use_lws=False,
41 |
42 | n_fft=800, # Extra window size is filled with 0 paddings to match this parameter
43 | hop_size=200, # For 16000Hz, 200 = 12.5 ms (0.0125 * sample_rate)
44 | win_size=800, # For 16000Hz, 800 = 50 ms (If None, win_size = n_fft) (0.05 * sample_rate)
45 | sample_rate=16000, # 16000Hz (corresponding to librispeech) (sox --i )
46 |
47 | frame_shift_ms=None, # Can replace hop_size parameter. (Recommended: 12.5)
48 |
49 | # Mel and Linear spectrograms normalization/scaling and clipping
50 | signal_normalization=True,
51 | # Whether to normalize mel spectrograms to some predefined range (following below parameters)
52 | allow_clipping_in_normalization=True, # Only relevant if mel_normalization = True
53 | symmetric_mels=True,
54 | # Whether to scale the data to be symmetric around 0. (Also multiplies the output range by 2,
55 | # faster and cleaner convergence)
56 | max_abs_value=4.,
57 | # max absolute value of data. If symmetric, data will be [-max, max] else [0, max] (Must not
58 | # be too big to avoid gradient explosion,
59 | # not too small for fast convergence)
60 | # Contribution by @begeekmyfriend
61 | # Spectrogram Pre-Emphasis (Lfilter: Reduce spectrogram noise and helps model certitude
62 | # levels. Also allows for better G&L phase reconstruction)
63 | preemphasize=True, # whether to apply filter
64 | preemphasis=0.97, # filter coefficient.
65 |
66 | # Limits
67 | min_level_db=-100,
68 | ref_level_db=20,
69 | fmin=55,
70 | # Set this to 55 if your speaker is male! if female, 95 should help taking off noise. (To
71 | # test depending on dataset. Pitch info: male~[65, 260], female~[100, 525])
72 | fmax=7600, # To be increased/reduced depending on data.
73 |
74 | ###################### Our training parameters #################################
75 | img_size=576,
76 | fps=25,
77 |
78 | batch_size=16,
79 | initial_learning_rate=1e-4,
80 | nepochs=200000000000000000, ### ctrl + c, stop whenever eval loss is consistently greater than train loss for ~10 epochs
81 | num_workers=16,
82 | checkpoint_interval=3000,
83 | eval_interval=3000,
84 | save_optimizer_state=True,
85 |
86 | syncnet_wt=0.0, # is initially zero, will be set automatically to 0.03 later. Leads to faster convergence.
87 | syncnet_batch_size=64,
88 | syncnet_lr=1e-4,
89 | syncnet_eval_interval=10000,
90 | syncnet_checkpoint_interval=10000,
91 |
92 | disc_wt=0.07,
93 | disc_initial_learning_rate=1e-4,
94 | )
95 |
96 |
97 | def hparams_debug_string():
98 | values = hparams.values()
99 | hp = [" %s: %s" % (name, values[name]) for name in sorted(values) if name != "sentences"]
100 | return "Hyperparameters:\n" + "\n".join(hp)
--------------------------------------------------------------------------------
/audio-preprocess.py:
--------------------------------------------------------------------------------
1 | from transformers import Wav2Vec2Processor, HubertModel
2 | import soundfile as sf
3 | import numpy as np
4 | import torch
5 |
6 | print("Loading the Wav2Vec2 Processor...")
7 | wav2vec2_processor = Wav2Vec2Processor.from_pretrained("facebook/hubert-large-ls960-ft")
8 | print("Loading the HuBERT Model...")
9 | hubert_model = HubertModel.from_pretrained("facebook/hubert-large-ls960-ft")
10 |
11 |
12 | def get_hubert_from_16k_wav(wav_16k_name):
13 | speech_16k, _ = sf.read(wav_16k_name)
14 | hubert = get_hubert_from_16k_speech(speech_16k)
15 | return hubert
16 |
17 | @torch.no_grad()
18 | def get_hubert_from_16k_speech(speech, device="cuda:0"):
19 | global hubert_model
20 | hubert_model = hubert_model.to(device)
21 | if speech.ndim ==2:
22 | speech = speech[:, 0] # [T, 2] ==> [T,]
23 | input_values_all = wav2vec2_processor(speech, return_tensors="pt", sampling_rate=16000).input_values # [1, T]
24 | input_values_all = input_values_all.to(device)
25 | # For long audio sequence, due to the memory limitation, we cannot process them in one run
26 | # HuBERT process the wav with a CNN of stride [5,2,2,2,2,2], making a stride of 320
27 | # Besides, the kernel is [10,3,3,3,3,2,2], making 400 a fundamental unit to get 1 time step.
28 | # So the CNN is euqal to a big Conv1D with kernel k=400 and stride s=320
29 | # We have the equation to calculate out time step: T = floor((t-k)/s)
30 | # To prevent overlap, we set each clip length of (K+S*(N-1)), where N is the expected length T of this clip
31 | # The start point of next clip should roll back with a length of (kernel-stride) so it is stride * N
32 | kernel = 400
33 | stride = 320
34 | clip_length = stride * 1000
35 | num_iter = input_values_all.shape[1] // clip_length
36 | expected_T = (input_values_all.shape[1] - (kernel-stride)) // stride
37 | res_lst = []
38 | for i in range(num_iter):
39 | if i == 0:
40 | start_idx = 0
41 | end_idx = clip_length - stride + kernel
42 | else:
43 | start_idx = clip_length * i
44 | end_idx = start_idx + (clip_length - stride + kernel)
45 | input_values = input_values_all[:, start_idx: end_idx]
46 | hidden_states = hubert_model.forward(input_values).last_hidden_state # [B=1, T=pts//320, hid=1024]
47 | res_lst.append(hidden_states[0])
48 | if num_iter > 0:
49 | input_values = input_values_all[:, clip_length * num_iter:]
50 | else:
51 | input_values = input_values_all
52 | # if input_values.shape[1] != 0:
53 | if input_values.shape[1] >= kernel: # if the last batch is shorter than kernel_size, skip it
54 | hidden_states = hubert_model(input_values).last_hidden_state # [B=1, T=pts//320, hid=1024]
55 | res_lst.append(hidden_states[0])
56 | ret = torch.cat(res_lst, dim=0).cpu() # [T, 1024]
57 | # assert ret.shape[0] == expected_T
58 | assert abs(ret.shape[0] - expected_T) <= 1
59 | if ret.shape[0] < expected_T:
60 | ret = torch.nn.functional.pad(ret, (0,0,0,expected_T-ret.shape[0]))
61 | else:
62 | ret = ret[:expected_T]
63 | return ret
64 |
65 | def make_even_first_dim(tensor):
66 | size = list(tensor.size())
67 | if size[0] % 2 == 1:
68 | size[0] -= 1
69 | return tensor[:size[0]]
70 | return tensor
71 |
72 | import soundfile as sf
73 | import numpy as np
74 | import torch
75 | from argparse import ArgumentParser
76 | import librosa
77 |
78 | parser = ArgumentParser()
79 | parser.add_argument('--wav', type=str, help='')
80 | args = parser.parse_args()
81 |
82 | wav_name = args.wav
83 |
84 | speech, sr = sf.read(wav_name)
85 | speech_16k = librosa.resample(speech, orig_sr=sr, target_sr=16000)
86 | print("SR: {} to {}".format(sr, 16000))
87 | # print(speech.shape, speech_16k.shape)
88 |
89 | hubert_hidden = get_hubert_from_16k_speech(speech_16k)
90 | hubert_hidden = make_even_first_dim(hubert_hidden).reshape(-1, 2, 1024)
91 | np.save(wav_name.replace('.wav', '_hu.npy'), hubert_hidden.detach().numpy())
92 | print(hubert_hidden.detach().numpy().shape)
--------------------------------------------------------------------------------