├── README.md ├── figure.png ├── transforms.py └── wildvsr_test.py /README.md: -------------------------------------------------------------------------------- 1 | 2 | # **Do VSR Models Generalize Beyond LRS3?** 3 | 4 | This repository contains our benchmark **WildVSR**, a new test set for Visual Speech Recognition on English, refer to the paper 5 | [Do VSR Models Generalize Beyond LRS3?](https://arxiv.org/abs/2311.14063). 6 | 7 | 8 | ## Dataset Summary 9 | 10 | The Lip Reading Sentences-3 (LRS3) benchmark has primarily been the focus of intense research in visual speech recognition (VSR) during the last few years. As a result, there is an increased risk of overfitting to its excessively used test set, which is only one hour duration. To alleviate this issue, we build **WildVSR**, a new VSR test set by closely following the LRS3 dataset creation processes. We then evaluate and analyse the extent to which the current VSR models generalise to the new test data. We evaluate a broad range of publicly available VSR models and find significant drops in performance on our test set, compared to their corresponding LRS3 results. 11 | 12 | Comparison of statistics between LRS3 and WildVSR. 13 | 14 | ![Lip2Vec Illustration](figure.png) 15 | 16 | 17 | ## Downloading the data: 18 | 19 | Data can be found at this [link](https://drive.google.com/file/d/1EUx-KffQSLQE5uc5MZaeHKQEZNeHdBwP/view?usp=drive_link). The test set is structured as follows: 20 | ```bash 21 | WildVSR/ 22 | ├── videos/ 23 | │ ├── 00001.mp4 24 | │ └── 00002.mp4 25 | ├── labels.json 26 | 27 | ``` 28 | The ```labels.json``` has the ```'video_ID': 'label'``` format, where each ```video_ID``` corresponds to the file names in the ```videos``` folder. 29 | 30 | You can use the ```wildvsr_test.py``` to load the data, note that all clips are cropped and transformed 31 | 32 | ```bash 33 | python wildvsr_test.py --wildvsr_path=[path_to_data] 34 | ``` 35 | 36 | ## Intended Use 37 | 38 | This dataset can be used to test models for visual speech recognition for English. It's particularly useful for research and development purposes in the field of audio-visual content processing. The data can be used to assess the performance of current and future models. 39 | 40 | ## Limitations and Biases 41 | Due to the data collection process focusing on YouTube, biases inherent to the platform may be present in the dataset. Also, while measures are taken to ensure diversity in content, the dataset might still be skewed towards certain types of content due to the filtering process. 42 | 43 | ## Ethical Considerations 44 | The dataset only uses free-to-use content, complying with legal requirements and ensuring respect for the original content creators. However, users of the dataset should keep in mind the potential biases and limitations inherent in the dataset. 45 | 46 | ## Citation 47 | ```bash 48 | @article{djilali2023vsr, 49 | title={Do VSR Models Generalize Beyond LRS3?}, 50 | author={Djilali, Yasser Abdelaziz Dahou and Narayan, Sanath and Bihan, Eustache Le and Boussaid, Haithem and Almazrouei, Ebtessam and Debbah, Merouane}, 51 | journal={arXiv preprint arXiv:2311.14063}, 52 | year={2023} 53 | } 54 | ``` 55 | -------------------------------------------------------------------------------- /figure.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/YasserdahouML/VSR_test_set/ff30ed9795c71e4c67fe7aa5de8b69bc7912c758/figure.png -------------------------------------------------------------------------------- /transforms.py: -------------------------------------------------------------------------------- 1 | __all__ = [ 2 | "Compose", "Normalize", "CenterCrop"] 3 | 4 | 5 | class Compose(object): 6 | """Compose several preprocess together. 7 | Args: 8 | preprocess (list of ``Preprocess`` objects): list of preprocess to compose. 9 | """ 10 | 11 | def __init__(self, preprocess): 12 | self.preprocess = preprocess 13 | 14 | def __call__(self, img): 15 | for t in self.preprocess: 16 | img = t(img) 17 | return img 18 | 19 | def __repr__(self): 20 | format_string = self.__class__.__name__ + '(' 21 | for t in self.preprocess: 22 | format_string += '\n' 23 | format_string += ' {0}'.format(t) 24 | format_string += '\n)' 25 | return format_string 26 | 27 | 28 | class Normalize(object): 29 | """Normalize a ndarray image with mean and standard deviation. 30 | """ 31 | 32 | def __init__(self, mean, std): 33 | self.mean = mean 34 | self.std = std 35 | 36 | def __call__(self, img): 37 | """ 38 | Args: 39 | tensor (Tensor): Tensor image of size (C, H, W) to be normalized. 40 | Returns: 41 | Tensor: Normalized Tensor image. 42 | """ 43 | # try: 44 | img = (img - self.mean) / self.std 45 | # except: 46 | # pass 47 | 48 | return img 49 | 50 | def __repr__(self): 51 | return self.__class__.__name__+'(mean={0}, std={1})'.format(self.mean, self.std) 52 | 53 | 54 | 55 | class CenterCrop(object): 56 | """Crop the given image at the center 57 | """ 58 | 59 | def __init__(self, crop_size): 60 | self.crop_size = crop_size 61 | 62 | def __call__(self, img): 63 | """ 64 | Args: 65 | img (numpy.ndarray): Images to be cropped. 66 | Returns: 67 | numpy.ndarray: Cropped image. 68 | """ 69 | frames, h, w = img.shape 70 | th, tw = self.crop_size 71 | delta_w = int(round((w - tw))/2.) 72 | delta_h = int(round((h - th))/2.) 73 | return img[:, delta_h:delta_h+th, delta_w:delta_w+tw] 74 | 75 | def __repr__(self): 76 | return self.__class__.__name__ + '(size={0})'.format(self.crop_size) -------------------------------------------------------------------------------- /wildvsr_test.py: -------------------------------------------------------------------------------- 1 | from pathlib import Path 2 | import torch 3 | import torch.utils.data 4 | import os 5 | import glob 6 | import numpy as np 7 | import cv2 8 | from transforms import * 9 | import json, argparse 10 | 11 | 12 | class WildVSR(object): 13 | def __init__(self, data_dir, 14 | convert_gray=True, 15 | ): 16 | 17 | self._data_dir = data_dir 18 | self.fps = 25 19 | self._convert_gray = convert_gray 20 | 21 | self.vid_transform = self.get_video_transform() 22 | self._data_files = [] 23 | 24 | search_str_original = os.path.join(self._data_dir , 'videos', '*.mp4') 25 | 26 | self._data_files.extend( glob.glob( search_str_original ) ) 27 | 28 | labels_path = os.path.join(self._data_dir , 'labels.json') 29 | 30 | with open(labels_path, 'r') as file: 31 | self.labels = json.load(file) 32 | 33 | print(len(self._data_files)) 34 | 35 | 36 | def get_video_transform(self): 37 | """get_video_transform. 38 | 39 | :param speed_rate: float, the speed rate between the frame rate of \ 40 | the input video and the frame rate used for training. 41 | """ 42 | 43 | crop_size = (88, 88) 44 | (mean, std) = (0.421, 0.165) 45 | 46 | return Compose([ 47 | Normalize(0.0, 255.0), 48 | CenterCrop(crop_size), 49 | Normalize(mean, std),]) 50 | 51 | 52 | def read_video_as_np_array(self, video_path): 53 | # Open the video file 54 | cap = cv2.VideoCapture(video_path) 55 | 56 | if not cap.isOpened(): 57 | raise IOError(f"Cannot open video file {video_path}") 58 | 59 | # Read the video frame by frame 60 | frames = [] 61 | while True: 62 | ret, frame = cap.read() 63 | if not ret: 64 | break 65 | frame = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY) 66 | frames.append(frame) 67 | 68 | # Close the video file 69 | cap.release() 70 | 71 | # Convert the list of frames to a numpy array 72 | frames_array = np.array(frames) 73 | 74 | return frames_array 75 | 76 | 77 | def __getitem__(self, idx): 78 | 79 | path_video = self._data_files[idx] 80 | video_ID = path_video.split('/')[-1] 81 | raw_data = self.read_video_as_np_array(path_video) 82 | preprocess_data = torch.FloatTensor(raw_data) 83 | preprocess_data = self.vid_transform(preprocess_data) 84 | 85 | label = self.labels[video_ID] 86 | 87 | return preprocess_data, label 88 | 89 | 90 | def __len__(self): 91 | return len(self._data_files) 92 | 93 | 94 | def build(args): 95 | root = Path(args.wildvsr_path) 96 | assert root.exists(), f'provided WildVSR path {root} does not exist' 97 | 98 | dataset = WildVSR(data_dir=root) 99 | 100 | return dataset 101 | 102 | def get_args_parser(): 103 | parser = argparse.ArgumentParser('Loading the WildVSR', add_help=False) 104 | parser.add_argument('--dataset_file', default='wildvsr') 105 | parser.add_argument('--wildvsr_path', default='', type=str) 106 | 107 | return parser 108 | 109 | def main(args): 110 | wildVSR_test_set = build(args=args) 111 | # Define model 112 | # model.eval() 113 | for idx in range(len(wildVSR_test_set)): 114 | clip, label = wildVSR_test_set[idx] 115 | # Process the clip and label as needed 116 | #prediction = model(clip) 117 | #wer_score = wer(prediction, clip) 118 | print(f"Ground-truth: {label}") # For demonstration purposes 119 | 120 | if __name__ == '__main__': 121 | parser = argparse.ArgumentParser('Loading the WildVSR', parents=[get_args_parser()]) 122 | args = parser.parse_args() 123 | main(args) --------------------------------------------------------------------------------