├── README.md
├── figure.png
├── transforms.py
└── wildvsr_test.py


/README.md:
--------------------------------------------------------------------------------
 1 | 
 2 | # **Do VSR Models Generalize Beyond LRS3?**
 3 | 
 4 | This repository contains our benchmark **WildVSR**, a new test set for Visual Speech Recognition on English, refer to the paper 
 5 | [Do VSR Models Generalize Beyond LRS3?](https://arxiv.org/abs/2311.14063).
 6 | 
 7 | 
 8 | ## Dataset Summary
 9 | 
10 | The Lip Reading Sentences-3 (LRS3) benchmark has primarily been the focus of intense research in visual speech recognition (VSR) during the last few years. As a result, there is an increased risk of overfitting to its excessively used test set, which is only one hour duration. To alleviate this issue, we build **WildVSR**, a new VSR test set by closely following the LRS3 dataset creation processes. We then evaluate and analyse the extent to which the current VSR models generalise to the new test data. We evaluate a broad range of publicly available VSR models and find significant drops in performance on our test set, compared to their corresponding LRS3 results.
11 | 
12 | Comparison of statistics between LRS3 and WildVSR.
13 | 
14 | ![Lip2Vec Illustration](figure.png)
15 | 
16 | 
17 | ## Downloading the data:
18 | 
19 | Data can be found at this [link](https://drive.google.com/file/d/1EUx-KffQSLQE5uc5MZaeHKQEZNeHdBwP/view?usp=drive_link). The test set is structured as follows:
20 | ```bash
21 | WildVSR/
22 | ├── videos/
23 | │ ├── 00001.mp4
24 | │ └── 00002.mp4
25 | ├── labels.json
26 | 
27 | ```
28 | The ```labels.json``` has the ```'video_ID': 'label'``` format, where each ```video_ID``` corresponds to the file names in the ```videos``` folder.
29 | 
30 | You can use the ```wildvsr_test.py``` to load the data, note that all clips are cropped and transformed
31 | 
32 | ```bash
33 | python wildvsr_test.py --wildvsr_path=[path_to_data]
34 | ```
35 | 
36 | ## Intended Use
37 | 
38 | This dataset can be used to test models for visual speech recognition for English. It's particularly useful for research and development purposes in the field of audio-visual content processing. The data can be used to assess the performance of current and future models.
39 | 
40 | ## Limitations and Biases
41 | Due to the data collection process focusing on YouTube, biases inherent to the platform may be present in the dataset. Also, while measures are taken to ensure diversity in content, the dataset might still be skewed towards certain types of content due to the filtering process.
42 | 
43 | ## Ethical Considerations
44 | The dataset only uses free-to-use content, complying with legal requirements and ensuring respect for the original content creators. However, users of the dataset should keep in mind the potential biases and limitations inherent in the dataset.
45 | 
46 | ## Citation
47 | ```bash
48 | @article{djilali2023vsr,
49 |   title={Do VSR Models Generalize Beyond LRS3?},
50 |   author={Djilali, Yasser Abdelaziz Dahou and Narayan, Sanath and Bihan, Eustache Le and Boussaid, Haithem and Almazrouei, Ebtessam and Debbah, Merouane},
51 |   journal={arXiv preprint arXiv:2311.14063},
52 |   year={2023}
53 | }
54 | ```
55 | 


--------------------------------------------------------------------------------
/figure.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/YasserdahouML/VSR_test_set/ff30ed9795c71e4c67fe7aa5de8b69bc7912c758/figure.png


--------------------------------------------------------------------------------
/transforms.py:
--------------------------------------------------------------------------------
 1 | __all__ = [
 2 |     "Compose", "Normalize", "CenterCrop"]
 3 | 
 4 | 
 5 | class Compose(object):
 6 |     """Compose several preprocess together.
 7 |     Args:
 8 |         preprocess (list of ``Preprocess`` objects): list of preprocess to compose.
 9 |     """
10 | 
11 |     def __init__(self, preprocess):
12 |         self.preprocess = preprocess
13 | 
14 |     def __call__(self, img):
15 |         for t in self.preprocess:
16 |             img = t(img)
17 |         return img
18 | 
19 |     def __repr__(self):
20 |         format_string = self.__class__.__name__ + '('
21 |         for t in self.preprocess:
22 |             format_string += '\n'
23 |             format_string += '    {0}'.format(t)
24 |         format_string += '\n)'
25 |         return format_string
26 | 
27 | 
28 | class Normalize(object):
29 |     """Normalize a ndarray image with mean and standard deviation.
30 |     """
31 | 
32 |     def __init__(self, mean, std):
33 |         self.mean = mean
34 |         self.std = std
35 | 
36 |     def __call__(self, img):
37 |         """
38 |         Args:
39 |             tensor (Tensor): Tensor image of size (C, H, W) to be normalized.
40 |         Returns:
41 |             Tensor: Normalized Tensor image.
42 |         """
43 |         # try:
44 |         img = (img - self.mean) / self.std
45 |         # except:
46 |         #     pass
47 | 
48 |         return img
49 | 
50 |     def __repr__(self):
51 |         return self.__class__.__name__+'(mean={0}, std={1})'.format(self.mean, self.std)
52 | 
53 | 
54 | 
55 | class CenterCrop(object):
56 |     """Crop the given image at the center
57 |     """
58 | 
59 |     def __init__(self, crop_size):
60 |         self.crop_size = crop_size
61 | 
62 |     def __call__(self, img):
63 |         """
64 |         Args:
65 |             img (numpy.ndarray): Images to be cropped.
66 |         Returns:
67 |             numpy.ndarray: Cropped image.
68 |         """
69 |         frames, h, w = img.shape
70 |         th, tw = self.crop_size
71 |         delta_w = int(round((w - tw))/2.)
72 |         delta_h = int(round((h - th))/2.)
73 |         return img[:, delta_h:delta_h+th, delta_w:delta_w+tw]
74 | 
75 |     def __repr__(self):
76 |         return self.__class__.__name__ + '(size={0})'.format(self.crop_size)


--------------------------------------------------------------------------------
/wildvsr_test.py:
--------------------------------------------------------------------------------
  1 | from pathlib import Path
  2 | import torch
  3 | import torch.utils.data
  4 | import os
  5 | import glob
  6 | import numpy as np
  7 | import cv2
  8 | from transforms import *
  9 | import json, argparse
 10 | 
 11 | 
 12 | class WildVSR(object):
 13 |     def __init__(self, data_dir,
 14 |         convert_gray=True,
 15 |         ):
 16 | 
 17 |         self._data_dir = data_dir
 18 |         self.fps = 25
 19 |         self._convert_gray = convert_gray
 20 | 
 21 |         self.vid_transform = self.get_video_transform()
 22 |         self._data_files = []
 23 | 
 24 |         search_str_original = os.path.join(self._data_dir , 'videos', '*.mp4')
 25 |         
 26 |         self._data_files.extend( glob.glob( search_str_original ) )
 27 |         
 28 |         labels_path = os.path.join(self._data_dir , 'labels.json')
 29 |         
 30 |         with open(labels_path, 'r') as file:
 31 |             self.labels = json.load(file)
 32 |  
 33 |         print(len(self._data_files))
 34 | 
 35 |         
 36 |     def get_video_transform(self):
 37 |         """get_video_transform.
 38 | 
 39 |         :param speed_rate: float, the speed rate between the frame rate of \
 40 |             the input video and the frame rate used for training.
 41 |         """
 42 | 
 43 |         crop_size = (88, 88)
 44 |         (mean, std) = (0.421, 0.165)
 45 |     
 46 |         return Compose([
 47 |             Normalize(0.0, 255.0),
 48 |             CenterCrop(crop_size),
 49 |             Normalize(mean, std),])
 50 |             
 51 |     
 52 |     def read_video_as_np_array(self, video_path):
 53 |         # Open the video file
 54 |         cap = cv2.VideoCapture(video_path)
 55 | 
 56 |         if not cap.isOpened():
 57 |             raise IOError(f"Cannot open video file {video_path}")
 58 | 
 59 |         # Read the video frame by frame
 60 |         frames = []
 61 |         while True:
 62 |             ret, frame = cap.read()
 63 |             if not ret:
 64 |                 break
 65 |             frame = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)
 66 |             frames.append(frame)
 67 | 
 68 |         # Close the video file
 69 |         cap.release()
 70 | 
 71 |         # Convert the list of frames to a numpy array
 72 |         frames_array = np.array(frames)
 73 | 
 74 |         return frames_array
 75 |     
 76 | 
 77 |     def __getitem__(self, idx):    
 78 |         
 79 |         path_video = self._data_files[idx]
 80 |         video_ID = path_video.split('/')[-1]
 81 |         raw_data = self.read_video_as_np_array(path_video)
 82 |         preprocess_data = torch.FloatTensor(raw_data)
 83 |         preprocess_data = self.vid_transform(preprocess_data)
 84 | 
 85 |         label =  self.labels[video_ID]
 86 |         
 87 |         return preprocess_data, label                                                                                                                                                                                                                                               
 88 | 
 89 |     
 90 |     def __len__(self):
 91 |         return len(self._data_files)
 92 | 
 93 | 
 94 | def build(args):
 95 |     root = Path(args.wildvsr_path)
 96 |     assert root.exists(), f'provided WildVSR path {root} does not exist'
 97 | 
 98 |     dataset = WildVSR(data_dir=root)
 99 |     
100 |     return dataset
101 | 
102 | def get_args_parser():
103 |     parser = argparse.ArgumentParser('Loading the WildVSR', add_help=False)
104 |     parser.add_argument('--dataset_file', default='wildvsr')
105 |     parser.add_argument('--wildvsr_path', default='', type=str)
106 | 
107 |     return parser
108 | 
109 | def main(args):
110 |     wildVSR_test_set = build(args=args)
111 |     # Define  model
112 |     # model.eval()
113 |     for idx in range(len(wildVSR_test_set)):
114 |         clip, label = wildVSR_test_set[idx]
115 |         # Process the clip and label as needed
116 |         #prediction = model(clip)
117 |         #wer_score = wer(prediction, clip)
118 |         print(f"Ground-truth: {label}") # For demonstration purposes
119 | 
120 | if __name__ == '__main__':
121 |     parser = argparse.ArgumentParser('Loading the WildVSR', parents=[get_args_parser()])
122 |     args = parser.parse_args()
123 |     main(args)


--------------------------------------------------------------------------------