├── .gitignore ├── LICENSE ├── README.md ├── YouTubeFacesDB ├── Dataset.py ├── Generator.py └── __init__.py ├── docs ├── Makefile └── source │ ├── api.rst │ ├── conf.py │ ├── index.rst │ └── tutorial.rst ├── examples ├── GenerateSubset.py ├── TrainKeras-Generator.py └── TrainKeras.py └── setup.py /.gitignore: -------------------------------------------------------------------------------- 1 | # ---> Python 2 | # Byte-compiled / optimized / DLL files 3 | __pycache__/ 4 | *.py[cod] 5 | *$py.class 6 | 7 | # C extensions 8 | *.so 9 | 10 | # Distribution / packaging 11 | .Python 12 | env/ 13 | build/ 14 | develop-eggs/ 15 | dist/ 16 | downloads/ 17 | eggs/ 18 | .eggs/ 19 | lib/ 20 | lib64/ 21 | parts/ 22 | sdist/ 23 | var/ 24 | *.egg-info/ 25 | .installed.cfg 26 | *.egg 27 | 28 | # PyInstaller 29 | # Usually these files are written by a python script from a template 30 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 31 | *.manifest 32 | *.spec 33 | 34 | # Installer logs 35 | pip-log.txt 36 | pip-delete-this-directory.txt 37 | 38 | # Unit test / coverage reports 39 | htmlcov/ 40 | .tox/ 41 | .coverage 42 | .coverage.* 43 | .cache 44 | nosetests.xml 45 | coverage.xml 46 | *,cover 47 | 48 | # Translations 49 | *.mo 50 | *.pot 51 | 52 | # Django stuff: 53 | *.log 54 | 55 | # Sphinx documentation 56 | docs/_build/ 57 | 58 | # PyBuilder 59 | target/ 60 | 61 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | Copyright (c) 3 | 4 | Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: 5 | 6 | The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. 7 | 8 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. 9 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # YouTubeFacesDB 2 | 3 | Python module allowing to load the YouTube Faces Database: 4 | 5 | 6 | 7 | **Description:** The data set contains 3,425 videos of 1,595 different people. All the videos were downloaded from YouTube. An average of 2.15 videos are available for each subject. The shortest clip duration is 48 frames, the longest clip is 6,070 frames, and the average length of a video clip is 181.3 frames. 8 | 9 | **For TUC users:** the DB is already downloaded on cortex at `/work/biblio/youtube Faces DB` (with the spaces). Copy it to your machine (in `/scratch`, as it is over 25GB) and uncompress it. 10 | 11 | **Author:** Julien Vitay 12 | 13 | **License:** MIT 14 | 15 | ## Installation 16 | 17 | Apart from the usual python (2.7) + numpy dependencies, the module requires: 18 | 19 | * **Pillow** `pip install Pillow --user` for image processing. 20 | * **h5py** `pip install h5py --user` to manage the HDF5 files. `libhdf5` should also be installed through your package manager. 21 | 22 | The module can then be installed locally with: 23 | 24 | ~~~bash 25 | python setup.py install --user 26 | ~~~ 27 | 28 | To build the documentation, you will need Sphinx `pip install Sphinx --user`. You can then go into the `docs/` directory and build it with: 29 | 30 | ~~~bash 31 | make html 32 | ~~~ 33 | 34 | You can then access `docs/build/html/index.html` with your browser. 35 | 36 | ## Tutorial 37 | 38 | ### Transforming the YouTube Faces Database into a HDF5 file 39 | 40 | An example is provided in `examples/GenerateSubset.py`. It accesses the dataset located at `/scratch/vitay/Datasets/YouTubeFaces` (`directory`), selects 10 random labels from it (`labels`), fetches all corresponding images (`max_number`), crops them to contains only the face area (`cropped`), transform them to luminance-based (`color`), resizes them to (100, 100) (`size`), prepends a dummy dimension to obtain a final numpy array of shape (1, 100, 100) (`bw_first`) and dumps them to the HDF5 file `ytfdb.h5` (`filename`). 41 | 42 | ~~~python 43 | from YouTubeFacesDB import generate_ytf_database 44 | generate_ytf_database( 45 | directory= '/scratch/vitay/Datasets/YouTubeFaces', # Location of the YTF dataset 46 | filename='ytfdb.h5', # Name of the HDF5 file to write to 47 | labels=10, # Number of labels to randomly select 48 | max_number=-1, # Maximum number of images to use 49 | size=(100, 100), # Size of the images 50 | color=False, # Black and white 51 | bw_first=True, # Final shape is (1, w, h) 52 | cropped=True # The original images are cropped to the faces 53 | ) 54 | ~~~ 55 | 56 | Check the doc of `generate_ytf_database` to see other arguments to this function. 57 | 58 | **Beware:** if you try to generate all color images of all labels with a size (100, 100), the process will take over half an hour and the HDF5 file will be over 50GB, so do not save it in your home directory. 59 | 60 | ### Loading the HDF5 file for usage in Python 61 | 62 | Once the HDF5 file has been generated, you can use it in a Python for learning. An example is provided in `examples/TrainKeras.py`, where a convolutional network written in Keras (`pip install Theano --user && pip install keras --user`) is trained on the data contained in `ytfdb.h5`. 63 | 64 | #### Loading the dataset into memory 65 | 66 | To load the data, you need to create a `YouTubeFacesDB` object, pass him the path the HDF5 file and call the `get()` option: 67 | 68 | ~~~python 69 | from YouTubeFacesDB import YouTubeFacesDB 70 | db = YouTubeFacesDB('ytfdb.h5') 71 | X, y = db.get() 72 | ~~~ 73 | 74 | `X` is a numpy array containing all input images. The first index correspond to the image number, the remaining ones to the shape of the numpy array representing each image. This information can also be retrieved through the attributes of the object: 75 | 76 | ~~~python 77 | N = db.nb_samples # number of samples, e.g. 10000 78 | d = db.input_dim # shape of the images, e.g. (1, 100, 100) 79 | ~~~ 80 | 81 | `y` is a numpy array containing the label index for each image (in vectorized form, see *categorical outputs*). You can access the number of labels, as well as the list of labels easily: 82 | 83 | ~~~python 84 | C = db.nb_classes # Number of classes 85 | labels = db.labels # List of strings for the labels 86 | ~~~ 87 | 88 | #### Transforming the data 89 | 90 | **Mean removal** 91 | 92 | `X` contains for each pixel a floating value between 0. and 1. (the conversion between integers [0..255] and floats [0...1] was done during the generation process). However, neural networks typically work much better when the input data has a zero mean. Fortunately, the mean input (i.e. the mean face) was also saved during the generation process. You can remove it from the input using: 93 | 94 | ~~~python 95 | mean_face = db.mean 96 | X -= mean_face 97 | ~~~ 98 | 99 | You can also tell the `YouTubeFacesDB` object to remove systematically this mean from the inputs: 100 | 101 | ~~~python 102 | db = YouTubeFacesDB('ytfdb.h5', mean_removal=True) 103 | X, y = db.get() 104 | ~~~ 105 | 106 | This way, `X` has a zero mean over the first axis, without needing to explicitly compute it. This is particularly useful when generating minibatches. 107 | 108 | **Categorical outputs** 109 | 110 | The outputs labels are originally integers between 0 and `db.nb_classes` - 1. To train neural networks, it often required to represent the output as binary arrays of length `db.nb_classes`. where only one element is 1 and the rest 0. For example, the third class among 10 would be represented by `0000000100`.This is the default representation returned by the `YouTubeFacesDB` object. 111 | 112 | If you prefer to get the labels as integers in `y`, you can specify it in the constructor: 113 | 114 | ~~~python 115 | db = YouTubeFacesDB('ytfdb.h5', output_type='integer') 116 | ~~~ 117 | 118 | The default value of `output_type` is `vector`. 119 | 120 | 121 | #### Splitting the data into training, validation and test sets 122 | 123 | `db.get()` returns by default the whole data. If you want to split this data into training, validation and test sets, you can call the method `split_dataset()`: 124 | 125 | ~~~python 126 | db.split_dataset(validation_size=0.2, test_size=0.1) 127 | ~~~ 128 | 129 | In this example, the validation set will contain 20% of the samples and the test set 10%. The rest stays in the training set. The samples are randomly chosen in the data. To retrieve the corresponding data, provide an argument to `get()`: 130 | 131 | ~~~python 132 | db.split_dataset(validation_size=0.2, test_size=0.1) 133 | X_train, y_train = db.get('train') 134 | X_val, y_val = db.get('val') 135 | X_test, y_test = db.get('test') 136 | ~~~ 137 | 138 | By default, the validation set has 20% of the data and the test set 0%. 139 | 140 | #### Generating minibatches 141 | 142 | Loading the whole dataset in memory with `get()` defeats the purpose of storing a large-scale dataset in a HDF5 file. In practice, it is recommended to load only minibatches (of let's say 1000 samples) one at a time, process them, and ask for a new one. 143 | 144 | The method `generate_batches()` returns a generator allowing to loop over a dataset and retrieve the data `(X, y)` for each minibatch: 145 | 146 | ~~~python 147 | for X, y in db.generate_batches(batch_size=100, dset='train', rest=True): 148 | do_something(X, y) 149 | ~~~ 150 | 151 | `batch_size` defines how many samples will be in each minibatch, `dset` from which dataset the samples will be taken (`['all', 'train', 'val', 'test']`) and `rest` what should be done with the last samples if the total number of samples is not a multiple of the batch size. For example, if the dataset has 1537 samples, and the batch size is 100, the `for` loop will be executed 15 times. The remaining 37 samples will be returned only if `rest` is set to True (as smaller batches may cause rpoblems with some tensor libraries). 152 | 153 | Between two calls to `generate_batches()`, the indices are shuffled, so the minibatches will never be identical between epochs. 154 | 155 | The example in `examples/TrainKeras-Generator.py` shows how to use minibatches with Keras. Strangely, the `fit_generator()` method of Keras does not work with this generator, as Keras runs the generator in a separate thread and the h5py module does not seem to like it... 156 | 157 | 158 | 159 | 160 | 161 | -------------------------------------------------------------------------------- /YouTubeFacesDB/Dataset.py: -------------------------------------------------------------------------------- 1 | # Standard library 2 | from __future__ import print_function, with_statement 3 | from time import time 4 | import re 5 | import os 6 | import copy 7 | import random 8 | import csv 9 | # Dependencies 10 | import numpy as np 11 | import h5py 12 | from PIL import Image 13 | 14 | 15 | def to_categorical(y, nb_classes=None): 16 | """ 17 | Convert class vector (integers from 0 to nb_classes) to binary class matrix, for use with categorical_crossentropy. 18 | 19 | Taken from Keras. 20 | """ 21 | Y = np.zeros((len(y), nb_classes)) 22 | for i in range(len(y)): 23 | Y[i, y[i]] = 1. 24 | return Y 25 | 26 | class YouTubeFacesDB(object): 27 | """ 28 | Class allowing to interact with a HDF5 file containing a subset of the Youtube Faces dataset. 29 | """ 30 | def __init__(self, filename, mean_removal=False, output_type='vector'): 31 | """ 32 | Parameters: 33 | 34 | * `filename`: path to the HDF5 file containing the data. 35 | * `mean_removal`: defines if the mean image should be substracted from each image. 36 | * `output_type`: ['integer', 'vector'] defines the output for each sample. 'integer' will return the index of the class (e.g. 3), while vector will return a vector ith nb_classes components, all zero but one (e.g. 000...00100). Default: vector. 37 | """ 38 | # Open the file 39 | self.filename = filename 40 | try: 41 | self.f = h5py.File(self.filename, "r") 42 | except Exception: 43 | print('Error:', self.filename, 'does not exist.') 44 | 45 | # Data 46 | self._X = self.f.get('X') 47 | self._y = self.f.get('Y') 48 | 49 | # Mean input 50 | self.mean_removal = mean_removal 51 | self.mean = np.array(self.f.get('mean')) 52 | 53 | # Size 54 | shape = self._X.shape 55 | #: Total number of samples in the dataset 56 | self.nb_samples = shape[0] 57 | #: Shape of the inputs 58 | self.input_dim = shape[1:] 59 | 60 | # Indices 61 | self._indices = list(range(self.nb_samples)) 62 | self._training_indices = self._indices 63 | self._validation_indices = [] 64 | self._test_indices = [] 65 | #: Number of samples in the training set 66 | self.nb_train = self.nb_samples 67 | #: Number of samples in the validation set 68 | self.nb_val = 0 69 | #: Number of samples in the test set 70 | self.nb_test = 0 71 | 72 | # Labels 73 | labels = self.f.get('labels') 74 | #: List of labels 75 | self.labels = [] 76 | for label in labels: 77 | self.labels.append(str(label[0])) 78 | #: Total number of classes 79 | self.nb_classes = len(self.labels) 80 | if not output_type in ['integer', 'vector']: 81 | print("Error: output_type must be in ['integer', 'vector']") 82 | output_type = 'vector' 83 | #: Output type ['integer', 'vector'] 84 | self.output_type = output_type 85 | 86 | #: Index of the video for each frame 87 | self.video = self.f.get('video') 88 | 89 | def split_dataset(self, validation_size=0.2, test_size=0.0): 90 | """ 91 | Split the dataset into a training set, a validation set and optionally a test set. 92 | 93 | Parameters: 94 | 95 | * `validation_size`: proportion of the data in the validation set (default: 0.2) 96 | * `test_size`: proportion of the data in the test set (default: 0.0) 97 | 98 | The split is only internal to the object (the method returns nothing), as the actual data should be later read from disk. 99 | 100 | This method sets the following attributes: 101 | 102 | * `self.nb_train`: number of samples in the training set. 103 | * `self.nb_val`: number of samples in the validation set. 104 | * `self.nb_test`: number of samples in the test set. 105 | 106 | To actually get the data, you will have to call either:: 107 | 108 | X, y = db.get('all') 109 | X_train, y_train = db.get('train') 110 | X_val, y_cal = db.get('val') 111 | X_test, y_test = db.get('test') 112 | """ 113 | # Number of examples 114 | self.nb_val = int(self.nb_samples*validation_size) 115 | self.nb_test = int(self.nb_samples*test_size) 116 | self.nb_train = self.nb_samples - self.nb_val - self.nb_test 117 | # Compute the indices 118 | indices = copy.deepcopy(self._indices) 119 | random.shuffle(indices) 120 | self._validation_indices = sorted(indices[:self.nb_val]) 121 | if self.nb_test != 0: 122 | self._test_indices = sorted(indices[self.nb_val:self.nb_val+self.nb_test]) 123 | else: 124 | self._test_indices = [] 125 | self._training_indices = sorted(indices[self.nb_val+self.nb_test:]) 126 | print('Training:', self.nb_train, '; Validation:', self.nb_val, '; Test:', self.nb_test, '; Total:', self.nb_samples) 127 | 128 | def get(self, dset='all'): 129 | """ 130 | Returns the whole dataset as a tuple (X, y) of numpy arrays. 131 | 132 | Parameters: 133 | 134 | * `dset`: string in ['train', 'val', 'test', 'all'] for the desired part of the dataset (default: 'all'). 135 | 136 | """ 137 | if dset == 'all': 138 | X = np.array(self._X) 139 | y = np.array(self._y, dtype='int32') 140 | elif dset == 'train': 141 | X = np.array(self._X[self._training_indices, ...]) 142 | y = np.array(self._y[self._training_indices, ...], dtype='int32') 143 | elif dset == 'val': 144 | X = np.array(self._X[self._validation_indices, ...]) 145 | y = np.array(self._y[self._validation_indices, ...], dtype='int32') 146 | elif dset == 'test': 147 | X = np.array(self._X[self._test_indices, ...]) 148 | y = np.array(self._y[self._test_indices, ...], dtype='int32') 149 | else: 150 | print("Error: the `dset` argument to get() must be in ['train', 'val', 'test', 'all']") 151 | X = np.array([[]]) 152 | y = np.array([], dtype='int32') 153 | 154 | return self._transform_data(X, y) 155 | 156 | def _transform_data(self, X, y): 157 | "Applies transformations to the data (mean_removal, output type..." 158 | # Mean removal 159 | if self.mean_removal: 160 | X -= self.mean 161 | 162 | # Categorical outputs 163 | if self.output_type == 'vector': 164 | y = to_categorical(y, self.nb_classes) 165 | 166 | return X, y 167 | 168 | def generate_batches(self, batch_size, dset='all', rest=True): 169 | """ 170 | Returns a minibatch of random samples of the DB as a (X, y) tuple every time it is called, until the dataset is fully seen. 171 | 172 | Parameters: 173 | 174 | * `batch_size`: number of samples per minibatch. 175 | * `dset`: string in ['train', 'val', 'test', 'all'] for the desired part of the dataset (default: 'all'). 176 | * `rest`: defines if the remaining samples after the last full minibatch should be sent anyway (default: True) 177 | """ 178 | # Access the dataset indices 179 | if dset=='train': 180 | indices = copy.deepcopy(self._training_indices) 181 | N = self.nb_train 182 | elif dset=='val': 183 | indices = copy.deepcopy(self._validation_indices) 184 | N = self.nb_val 185 | elif dset=='test': 186 | indices = copy.deepcopy(self._test_indices) 187 | N = self.nb_test 188 | elif dset=='all': 189 | indices = copy.deepcopy(self._indices) 190 | N = self.nb_samples 191 | else: 192 | print("Error: the `dset` argument to get_batch() must be in ['train', 'val', 'test', 'all']") 193 | return 194 | 195 | # Compute the number of minibatches 196 | nb_batches = int(N/batch_size) 197 | rest_batches = N - nb_batches*batch_size # what to do with the rest? 198 | 199 | # Shuffle the training set 200 | random.shuffle(indices) 201 | 202 | # Iterate over the minibatches 203 | for b in range(nb_batches): 204 | samples = sorted(indices[b*batch_size:(b+1)*batch_size]) 205 | X = np.array(self._X[samples, ...]) 206 | y = np.array(self._y[samples, ...], dtype='int32') 207 | X, y = self._transform_data(X, y) 208 | yield X, y 209 | 210 | # Throw the rest. May be inefficient. 211 | if rest_batches != 0 and rest: 212 | samples = sorted(indices[nb_batches*batch_size:]) 213 | X = np.array(self._X[samples, ...]) 214 | y = np.array(self._y[samples, ...], dtype='int32') 215 | X, y = self._transform_data(X, y) 216 | yield X, y 217 | 218 | 219 | 220 | 221 | 222 | 223 | -------------------------------------------------------------------------------- /YouTubeFacesDB/Generator.py: -------------------------------------------------------------------------------- 1 | # Standard library 2 | from __future__ import print_function, with_statement 3 | from time import time 4 | import re 5 | import os 6 | import random 7 | import csv 8 | # Dependencies 9 | import numpy as np 10 | import h5py 11 | from PIL import Image 12 | 13 | # Structure of the YFT directory 14 | original_folder = '/frame_images_DB/' 15 | aligned_folder = '/aligned_images_DB/' 16 | 17 | def _get_labels(directory): 18 | "Retrieves the list of labels from the aligned directory" 19 | return sorted(os.listdir(directory + aligned_folder), key=lambda s: s.lower()) 20 | 21 | def _check_labels(labels, directory): 22 | "Compares the provided list of labels to ones which exist." 23 | orig = _get_labels(directory) 24 | for label in labels: 25 | if not label in orig: 26 | print('Error:', label, 'does not exist in the YouTube Faces database.') 27 | exit(0) 28 | 29 | 30 | def _gather_images_info(directory, labels, max_images_per_person): 31 | "Iterates over all labels and gets the filenames and crop information" 32 | data = [] 33 | for name in labels: 34 | # Each image is described in frame_images_DB/Aaron_Eckhart.labeled_faces.txt 35 | data_file = directory + original_folder + name + '.labeled_faces.txt' 36 | # Read the file 37 | data_person = [] 38 | try: 39 | with open(data_file, 'r') as csvfile: 40 | for entry in csv.reader(csvfile, delimiter=','): 41 | img_name = entry[0].replace('\\', '/') 42 | center_w, center_h = int(entry[2]), int(entry[3]) 43 | size_w, size_h = int(entry[4]), int(entry[5]) 44 | data_person.append({ 45 | 'name': name, 46 | 'filename': img_name, 47 | 'center': (center_w, center_h), 48 | 'size': (size_w, size_h) 49 | }) 50 | except Exception as e: 51 | print('Error: could not read', data_file) 52 | print(e) 53 | return data 54 | # Possibly select a maximal number of them 55 | if max_images_per_person == -1: # everything 56 | data.extend(data_person) 57 | else: 58 | data.extend(random.sample(data_person, max_images_per_person)) 59 | 60 | return data 61 | 62 | def _create_db(directory, metadata, labels, filename, size, color, rgb_first, bw_first, cropped): 63 | "Main method to fetch all images into the hdf5 DB." 64 | # Total number of images 65 | nb_images = len(metadata) 66 | # Final size of the image 67 | if color and rgb_first: 68 | final_size = (3, ) # channel is first 69 | elif not color and bw_first: # add a dummy (1,) in front 70 | final_size = (1, ) 71 | else: 72 | final_size = () 73 | final_size += size 74 | if color and not rgb_first: 75 | final_size += (3,) 76 | print('Final size of the images:', final_size) 77 | # Initialize the hdf5 DB 78 | f = h5py.File(filename, "w") 79 | dset_X = f.create_dataset("X", (nb_images,) + final_size, dtype='f') 80 | dset_Y = f.create_dataset("Y", (nb_images,), dtype='i') 81 | dset_video = f.create_dataset("video", (nb_images,), dtype='i') 82 | # Save the list of labels 83 | max_length = 0 84 | for label in labels: 85 | max_length = max(max_length, len(label)) 86 | asciiList = [n.encode("ascii", "ignore") for n in labels] 87 | f.create_dataset('labels', (len(labels),1),'S'+str(max_length), labels) 88 | # Compute the mean image 89 | mean_img = np.zeros(final_size) 90 | # Iterate over all images 91 | for idx in range(nb_images): 92 | # Retrieve the info 93 | description= metadata[idx] # description 94 | name = description['name'] # name of the person 95 | y = labels.index(name) # corresponding index between 0 and 1594 96 | filename = description['filename'] # complete filename 97 | video_idx = int(re.findall(r'/([\d]+)/', filename)[0]) # index of the video 98 | center_w, center_h = description['center'] # center of the face 99 | size_w, size_h = description['size'] # size of the face 100 | # Get the image 101 | img_file_path = directory + original_folder + filename 102 | img = Image.open(img_file_path) 103 | # Crop the image to the face 104 | if cropped: 105 | img = img.crop((center_w - size_w/2, center_h - size_h/2, center_w + size_w/2, center_h + size_h/2)) 106 | # Resize the image 107 | img = img.resize(size) 108 | # Color 109 | if not color: 110 | img = img.convert('L') 111 | # Get the numpy array 112 | img_data = np.array(img).astype('float32')/255. 113 | # Swap the axes (to have (3, w, h)) 114 | if color and rgb_first: 115 | img_data = img_data.swapaxes(0, 2) 116 | # Add a dummy first axis to BW images for theano 117 | if not color and bw_first: 118 | img_data = img_data[np.newaxis, :, :] 119 | # Update the mean 120 | mean_img += (img_data - mean_img)/float(idx+1) 121 | # Push it to the HDF5 file 122 | dset_X[idx, ...] = img_data 123 | dset_Y[idx] = y 124 | dset_video[idx] = video_idx 125 | # Last, save the mean 126 | f.create_dataset('mean', (1, )+final_size,'f', mean_img) 127 | 128 | def generate_ytf_database( 129 | directory, 130 | filename, 131 | size, 132 | labels=None, 133 | max_number=-1, 134 | max_images_per_person=-1, 135 | color=True, 136 | rgb_first=True, 137 | bw_first=False, 138 | cropped=True): 139 | """ 140 | Method to generate a subset of the YouTube Faces database in a HDF5 file. 141 | 142 | Arguments: 143 | 144 | * `directory`: director where the YouTube Face DB is located. 145 | * `filename`: path and name of the hdf5 file where the DB will be saved. 146 | * `size`: (width, height) size for the extracted images. 147 | * `labels`: number or list of labels which should be used (default: None, for all labels). 148 | * `max_number`: maximum number of images (default: -1, all of them). 149 | * `max_images_per_person`: maximum number of images which should be extracted per person (default: -1, all images) 150 | * `color`: if the color channels should be preserved (default: True) 151 | * `rgb_first`: if True, the numpy arrays of colored images will have the shape (3, w, h), otherwise (w, h, 3) (default: True). Useful for Theano backends. 152 | * `bw_first`: if True, the numpy arrays of black&white images will have the shape (1, w, h), otherwise (w, h) (default: False). Useful for Theano backends. 153 | * `cropped`: if the images should be cropped around the detected face (default: True) 154 | """ 155 | tstart = time() 156 | # Get the labels 157 | if labels==None or labels == -1: 158 | print('Retrieving all labels...') 159 | labels = _get_labels(directory) 160 | elif isinstance(labels, int): 161 | print('Generating', labels, 'labels randomly...') 162 | nb_labels = labels 163 | orig = _get_labels(directory) 164 | if nb_labels >= len(orig): 165 | print('There are only', len(orig), 'labels in the database...') 166 | labels = orig 167 | else: 168 | labels = sorted(random.sample(orig, nb_labels), key=lambda s: s.lower()) 169 | for label in labels: 170 | print('\t', label) 171 | else: 172 | print('Checking the labels...') 173 | _check_labels(labels, directory) 174 | 175 | # Retrieve the metadata on all images 176 | print('Gathering image locations...') 177 | metadata = _gather_images_info(directory, labels, max_images_per_person) 178 | nb_images = len(metadata) 179 | print('Found', nb_images, 'images for', len(labels), 'people.') 180 | 181 | # Reduce the number of images 182 | if max_number != -1 and max_number < nb_images: 183 | print('Reducing this number to', max_number) 184 | metadata = random.sample(metadata, max_number) 185 | 186 | # Get all the images, crop/resize them, and save them into a hdf5 file 187 | _create_db(directory, metadata, labels, filename, size, color, rgb_first, bw_first, cropped) 188 | print('Done in', time()-tstart, 'seconds.') 189 | -------------------------------------------------------------------------------- /YouTubeFacesDB/__init__.py: -------------------------------------------------------------------------------- 1 | from Generator import generate_ytf_database 2 | from Dataset import YouTubeFacesDB 3 | -------------------------------------------------------------------------------- /docs/Makefile: -------------------------------------------------------------------------------- 1 | # Makefile for Sphinx documentation 2 | # 3 | 4 | # You can set these variables from the command line. 5 | SPHINXOPTS = 6 | SPHINXBUILD = sphinx-build 7 | PAPER = 8 | BUILDDIR = build 9 | 10 | # User-friendly check for sphinx-build 11 | ifeq ($(shell which $(SPHINXBUILD) >/dev/null 2>&1; echo $$?), 1) 12 | $(error The '$(SPHINXBUILD)' command was not found. Make sure you have Sphinx installed, then set the SPHINXBUILD environment variable to point to the full path of the '$(SPHINXBUILD)' executable. Alternatively you can add the directory with the executable to your PATH. If you don't have Sphinx installed, grab it from http://sphinx-doc.org/) 13 | endif 14 | 15 | # Internal variables. 16 | PAPEROPT_a4 = -D latex_paper_size=a4 17 | PAPEROPT_letter = -D latex_paper_size=letter 18 | ALLSPHINXOPTS = -d $(BUILDDIR)/doctrees $(PAPEROPT_$(PAPER)) $(SPHINXOPTS) source 19 | # the i18n builder cannot share the environment and doctrees with the others 20 | I18NSPHINXOPTS = $(PAPEROPT_$(PAPER)) $(SPHINXOPTS) source 21 | 22 | .PHONY: help clean html dirhtml singlehtml pickle json htmlhelp qthelp devhelp epub latex latexpdf text man changes linkcheck doctest gettext 23 | 24 | help: 25 | @echo "Please use \`make ' where is one of" 26 | @echo " html to make standalone HTML files" 27 | @echo " dirhtml to make HTML files named index.html in directories" 28 | @echo " singlehtml to make a single large HTML file" 29 | @echo " pickle to make pickle files" 30 | @echo " json to make JSON files" 31 | @echo " htmlhelp to make HTML files and a HTML help project" 32 | @echo " qthelp to make HTML files and a qthelp project" 33 | @echo " devhelp to make HTML files and a Devhelp project" 34 | @echo " epub to make an epub" 35 | @echo " latex to make LaTeX files, you can set PAPER=a4 or PAPER=letter" 36 | @echo " latexpdf to make LaTeX files and run them through pdflatex" 37 | @echo " latexpdfja to make LaTeX files and run them through platex/dvipdfmx" 38 | @echo " text to make text files" 39 | @echo " man to make manual pages" 40 | @echo " texinfo to make Texinfo files" 41 | @echo " info to make Texinfo files and run them through makeinfo" 42 | @echo " gettext to make PO message catalogs" 43 | @echo " changes to make an overview of all changed/added/deprecated items" 44 | @echo " xml to make Docutils-native XML files" 45 | @echo " pseudoxml to make pseudoxml-XML files for display purposes" 46 | @echo " linkcheck to check all external links for integrity" 47 | @echo " doctest to run all doctests embedded in the documentation (if enabled)" 48 | 49 | clean: 50 | rm -rf $(BUILDDIR)/* 51 | 52 | readme: 53 | pandoc ../README.md -o source/tutorial.rst 54 | 55 | html: 56 | $(SPHINXBUILD) -b html $(ALLSPHINXOPTS) $(BUILDDIR)/html 57 | @echo 58 | @echo "Build finished. The HTML pages are in $(BUILDDIR)/html." 59 | 60 | dirhtml: 61 | $(SPHINXBUILD) -b dirhtml $(ALLSPHINXOPTS) $(BUILDDIR)/dirhtml 62 | @echo 63 | @echo "Build finished. The HTML pages are in $(BUILDDIR)/dirhtml." 64 | 65 | singlehtml: 66 | $(SPHINXBUILD) -b singlehtml $(ALLSPHINXOPTS) $(BUILDDIR)/singlehtml 67 | @echo 68 | @echo "Build finished. The HTML page is in $(BUILDDIR)/singlehtml." 69 | 70 | pickle: 71 | $(SPHINXBUILD) -b pickle $(ALLSPHINXOPTS) $(BUILDDIR)/pickle 72 | @echo 73 | @echo "Build finished; now you can process the pickle files." 74 | 75 | json: 76 | $(SPHINXBUILD) -b json $(ALLSPHINXOPTS) $(BUILDDIR)/json 77 | @echo 78 | @echo "Build finished; now you can process the JSON files." 79 | 80 | htmlhelp: 81 | $(SPHINXBUILD) -b htmlhelp $(ALLSPHINXOPTS) $(BUILDDIR)/htmlhelp 82 | @echo 83 | @echo "Build finished; now you can run HTML Help Workshop with the" \ 84 | ".hhp project file in $(BUILDDIR)/htmlhelp." 85 | 86 | qthelp: 87 | $(SPHINXBUILD) -b qthelp $(ALLSPHINXOPTS) $(BUILDDIR)/qthelp 88 | @echo 89 | @echo "Build finished; now you can run "qcollectiongenerator" with the" \ 90 | ".qhcp project file in $(BUILDDIR)/qthelp, like this:" 91 | @echo "# qcollectiongenerator $(BUILDDIR)/qthelp/YouTubeFacesDB.qhcp" 92 | @echo "To view the help file:" 93 | @echo "# assistant -collectionFile $(BUILDDIR)/qthelp/YouTubeFacesDB.qhc" 94 | 95 | devhelp: 96 | $(SPHINXBUILD) -b devhelp $(ALLSPHINXOPTS) $(BUILDDIR)/devhelp 97 | @echo 98 | @echo "Build finished." 99 | @echo "To view the help file:" 100 | @echo "# mkdir -p $$HOME/.local/share/devhelp/YouTubeFacesDB" 101 | @echo "# ln -s $(BUILDDIR)/devhelp $$HOME/.local/share/devhelp/YouTubeFacesDB" 102 | @echo "# devhelp" 103 | 104 | epub: 105 | $(SPHINXBUILD) -b epub $(ALLSPHINXOPTS) $(BUILDDIR)/epub 106 | @echo 107 | @echo "Build finished. The epub file is in $(BUILDDIR)/epub." 108 | 109 | latex: 110 | $(SPHINXBUILD) -b latex $(ALLSPHINXOPTS) $(BUILDDIR)/latex 111 | @echo 112 | @echo "Build finished; the LaTeX files are in $(BUILDDIR)/latex." 113 | @echo "Run \`make' in that directory to run these through (pdf)latex" \ 114 | "(use \`make latexpdf' here to do that automatically)." 115 | 116 | latexpdf: 117 | $(SPHINXBUILD) -b latex $(ALLSPHINXOPTS) $(BUILDDIR)/latex 118 | @echo "Running LaTeX files through pdflatex..." 119 | $(MAKE) -C $(BUILDDIR)/latex all-pdf 120 | @echo "pdflatex finished; the PDF files are in $(BUILDDIR)/latex." 121 | 122 | latexpdfja: 123 | $(SPHINXBUILD) -b latex $(ALLSPHINXOPTS) $(BUILDDIR)/latex 124 | @echo "Running LaTeX files through platex and dvipdfmx..." 125 | $(MAKE) -C $(BUILDDIR)/latex all-pdf-ja 126 | @echo "pdflatex finished; the PDF files are in $(BUILDDIR)/latex." 127 | 128 | text: 129 | $(SPHINXBUILD) -b text $(ALLSPHINXOPTS) $(BUILDDIR)/text 130 | @echo 131 | @echo "Build finished. The text files are in $(BUILDDIR)/text." 132 | 133 | man: 134 | $(SPHINXBUILD) -b man $(ALLSPHINXOPTS) $(BUILDDIR)/man 135 | @echo 136 | @echo "Build finished. The manual pages are in $(BUILDDIR)/man." 137 | 138 | texinfo: 139 | $(SPHINXBUILD) -b texinfo $(ALLSPHINXOPTS) $(BUILDDIR)/texinfo 140 | @echo 141 | @echo "Build finished. The Texinfo files are in $(BUILDDIR)/texinfo." 142 | @echo "Run \`make' in that directory to run these through makeinfo" \ 143 | "(use \`make info' here to do that automatically)." 144 | 145 | info: 146 | $(SPHINXBUILD) -b texinfo $(ALLSPHINXOPTS) $(BUILDDIR)/texinfo 147 | @echo "Running Texinfo files through makeinfo..." 148 | make -C $(BUILDDIR)/texinfo info 149 | @echo "makeinfo finished; the Info files are in $(BUILDDIR)/texinfo." 150 | 151 | gettext: 152 | $(SPHINXBUILD) -b gettext $(I18NSPHINXOPTS) $(BUILDDIR)/locale 153 | @echo 154 | @echo "Build finished. The message catalogs are in $(BUILDDIR)/locale." 155 | 156 | changes: 157 | $(SPHINXBUILD) -b changes $(ALLSPHINXOPTS) $(BUILDDIR)/changes 158 | @echo 159 | @echo "The overview file is in $(BUILDDIR)/changes." 160 | 161 | linkcheck: 162 | $(SPHINXBUILD) -b linkcheck $(ALLSPHINXOPTS) $(BUILDDIR)/linkcheck 163 | @echo 164 | @echo "Link check complete; look for any errors in the above output " \ 165 | "or in $(BUILDDIR)/linkcheck/output.txt." 166 | 167 | doctest: 168 | $(SPHINXBUILD) -b doctest $(ALLSPHINXOPTS) $(BUILDDIR)/doctest 169 | @echo "Testing of doctests in the sources finished, look at the " \ 170 | "results in $(BUILDDIR)/doctest/output.txt." 171 | 172 | xml: 173 | $(SPHINXBUILD) -b xml $(ALLSPHINXOPTS) $(BUILDDIR)/xml 174 | @echo 175 | @echo "Build finished. The XML files are in $(BUILDDIR)/xml." 176 | 177 | pseudoxml: 178 | $(SPHINXBUILD) -b pseudoxml $(ALLSPHINXOPTS) $(BUILDDIR)/pseudoxml 179 | @echo 180 | @echo "Build finished. The pseudo-XML files are in $(BUILDDIR)/pseudoxml." 181 | -------------------------------------------------------------------------------- /docs/source/api.rst: -------------------------------------------------------------------------------- 1 | Documentation 2 | ============= 3 | 4 | Method ``generate_ytf_database`` 5 | -------------------------------- 6 | 7 | .. autofunction:: YouTubeFacesDB.generate_ytf_database 8 | 9 | Class ``YouTubeFacesDB`` 10 | ------------------------ 11 | 12 | .. autoclass:: YouTubeFacesDB.YouTubeFacesDB 13 | :members: -------------------------------------------------------------------------------- /docs/source/conf.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | # 3 | # YouTubeFacesDB documentation build configuration file, created by 4 | # sphinx-quickstart on Wed Feb 10 18:14:23 2016. 5 | # 6 | # This file is execfile()d with the current directory set to its 7 | # containing dir. 8 | # 9 | # Note that not all possible configuration values are present in this 10 | # autogenerated file. 11 | # 12 | # All configuration values have a default; values that are commented out 13 | # serve to show the default. 14 | 15 | import sys 16 | import os 17 | 18 | # If extensions (or modules to document with autodoc) are in another directory, 19 | # add these directories to sys.path here. If the directory is relative to the 20 | # documentation root, use os.path.abspath to make it absolute, like shown here. 21 | sys.path.insert(0, os.path.abspath('..')) 22 | 23 | # -- General configuration ------------------------------------------------ 24 | 25 | # If your documentation needs a minimal Sphinx version, state it here. 26 | #needs_sphinx = '1.0' 27 | 28 | # Add any Sphinx extension module names here, as strings. They can be 29 | # extensions coming with Sphinx (named 'sphinx.ext.*') or your custom 30 | # ones. 31 | extensions = [ 32 | 'sphinx.ext.autodoc', 33 | 'sphinx.ext.mathjax', 34 | 'sphinx.ext.viewcode', 35 | ] 36 | 37 | # Add any paths that contain templates here, relative to this directory. 38 | templates_path = ['_templates'] 39 | 40 | # The suffix of source filenames. 41 | source_suffix = '.rst' 42 | 43 | # The encoding of source files. 44 | #source_encoding = 'utf-8-sig' 45 | 46 | # The master toctree document. 47 | master_doc = 'index' 48 | 49 | # General information about the project. 50 | project = u'YouTubeFacesDB' 51 | copyright = u'2016, Julien Vitay' 52 | 53 | # The version info for the project you're documenting, acts as replacement for 54 | # |version| and |release|, also used in various other places throughout the 55 | # built documents. 56 | # 57 | # The short X.Y version. 58 | version = '0.0.1' 59 | # The full version, including alpha/beta/rc tags. 60 | release = '0.0.1' 61 | 62 | # The language for content autogenerated by Sphinx. Refer to documentation 63 | # for a list of supported languages. 64 | #language = None 65 | 66 | # There are two options for replacing |today|: either, you set today to some 67 | # non-false value, then it is used: 68 | #today = '' 69 | # Else, today_fmt is used as the format for a strftime call. 70 | #today_fmt = '%B %d, %Y' 71 | 72 | # List of patterns, relative to source directory, that match files and 73 | # directories to ignore when looking for source files. 74 | exclude_patterns = [] 75 | 76 | # The reST default role (used for this markup: `text`) to use for all 77 | # documents. 78 | #default_role = None 79 | 80 | # If true, '()' will be appended to :func: etc. cross-reference text. 81 | #add_function_parentheses = True 82 | 83 | # If true, the current module name will be prepended to all description 84 | # unit titles (such as .. function::). 85 | #add_module_names = True 86 | 87 | # If true, sectionauthor and moduleauthor directives will be shown in the 88 | # output. They are ignored by default. 89 | #show_authors = False 90 | 91 | # The name of the Pygments (syntax highlighting) style to use. 92 | pygments_style = 'sphinx' 93 | 94 | # A list of ignored prefixes for module index sorting. 95 | #modindex_common_prefix = [] 96 | 97 | # If true, keep warnings as "system message" paragraphs in the built documents. 98 | #keep_warnings = False 99 | 100 | 101 | # -- Options for HTML output ---------------------------------------------- 102 | 103 | # The theme to use for HTML and HTML Help pages. See the documentation for 104 | # a list of builtin themes. 105 | html_theme = 'haiku' 106 | 107 | # Theme options are theme-specific and customize the look and feel of a theme 108 | # further. For a list of options available for each theme, see the 109 | # documentation. 110 | #html_theme_options = {} 111 | 112 | # Add any paths that contain custom themes here, relative to this directory. 113 | #html_theme_path = [] 114 | 115 | # The name for this set of Sphinx documents. If None, it defaults to 116 | # " v documentation". 117 | #html_title = None 118 | 119 | # A shorter title for the navigation bar. Default is the same as html_title. 120 | #html_short_title = None 121 | 122 | # The name of an image file (relative to this directory) to place at the top 123 | # of the sidebar. 124 | #html_logo = None 125 | 126 | # The name of an image file (within the static path) to use as favicon of the 127 | # docs. This file should be a Windows icon file (.ico) being 16x16 or 32x32 128 | # pixels large. 129 | #html_favicon = None 130 | 131 | # Add any paths that contain custom static files (such as style sheets) here, 132 | # relative to this directory. They are copied after the builtin static files, 133 | # so a file named "default.css" will overwrite the builtin "default.css". 134 | html_static_path = ['_static'] 135 | 136 | # Add any extra paths that contain custom files (such as robots.txt or 137 | # .htaccess) here, relative to this directory. These files are copied 138 | # directly to the root of the documentation. 139 | #html_extra_path = [] 140 | 141 | # If not '', a 'Last updated on:' timestamp is inserted at every page bottom, 142 | # using the given strftime format. 143 | #html_last_updated_fmt = '%b %d, %Y' 144 | 145 | # If true, SmartyPants will be used to convert quotes and dashes to 146 | # typographically correct entities. 147 | #html_use_smartypants = True 148 | 149 | # Custom sidebar templates, maps document names to template names. 150 | #html_sidebars = {} 151 | 152 | # Additional templates that should be rendered to pages, maps page names to 153 | # template names. 154 | #html_additional_pages = {} 155 | 156 | # If false, no module index is generated. 157 | #html_domain_indices = True 158 | 159 | # If false, no index is generated. 160 | #html_use_index = True 161 | 162 | # If true, the index is split into individual pages for each letter. 163 | #html_split_index = False 164 | 165 | # If true, links to the reST sources are added to the pages. 166 | #html_show_sourcelink = True 167 | 168 | # If true, "Created using Sphinx" is shown in the HTML footer. Default is True. 169 | #html_show_sphinx = True 170 | 171 | # If true, "(C) Copyright ..." is shown in the HTML footer. Default is True. 172 | #html_show_copyright = True 173 | 174 | # If true, an OpenSearch description file will be output, and all pages will 175 | # contain a tag referring to it. The value of this option must be the 176 | # base URL from which the finished HTML is served. 177 | #html_use_opensearch = '' 178 | 179 | # This is the file name suffix for HTML files (e.g. ".xhtml"). 180 | #html_file_suffix = None 181 | 182 | # Output file base name for HTML help builder. 183 | htmlhelp_basename = 'YouTubeFacesDBdoc' 184 | 185 | 186 | # -- Options for LaTeX output --------------------------------------------- 187 | 188 | latex_elements = { 189 | # The paper size ('letterpaper' or 'a4paper'). 190 | #'papersize': 'letterpaper', 191 | 192 | # The font size ('10pt', '11pt' or '12pt'). 193 | #'pointsize': '10pt', 194 | 195 | # Additional stuff for the LaTeX preamble. 196 | #'preamble': '', 197 | } 198 | 199 | # Grouping the document tree into LaTeX files. List of tuples 200 | # (source start file, target name, title, 201 | # author, documentclass [howto, manual, or own class]). 202 | latex_documents = [ 203 | ('index', 'YouTubeFacesDB.tex', u'YouTubeFacesDB Documentation', 204 | u'Julien Vitay', 'manual'), 205 | ] 206 | 207 | # The name of an image file (relative to this directory) to place at the top of 208 | # the title page. 209 | #latex_logo = None 210 | 211 | # For "manual" documents, if this is true, then toplevel headings are parts, 212 | # not chapters. 213 | #latex_use_parts = False 214 | 215 | # If true, show page references after internal links. 216 | #latex_show_pagerefs = False 217 | 218 | # If true, show URL addresses after external links. 219 | #latex_show_urls = False 220 | 221 | # Documents to append as an appendix to all manuals. 222 | #latex_appendices = [] 223 | 224 | # If false, no module index is generated. 225 | #latex_domain_indices = True 226 | 227 | 228 | # -- Options for manual page output --------------------------------------- 229 | 230 | # One entry per manual page. List of tuples 231 | # (source start file, name, description, authors, manual section). 232 | man_pages = [ 233 | ('index', 'youtubefacesdb', u'YouTubeFacesDB Documentation', 234 | [u'Julien Vitay'], 1) 235 | ] 236 | 237 | # If true, show URL addresses after external links. 238 | #man_show_urls = False 239 | 240 | 241 | # -- Options for Texinfo output ------------------------------------------- 242 | 243 | # Grouping the document tree into Texinfo files. List of tuples 244 | # (source start file, target name, title, author, 245 | # dir menu entry, description, category) 246 | texinfo_documents = [ 247 | ('index', 'YouTubeFacesDB', u'YouTubeFacesDB Documentation', 248 | u'Julien Vitay', 'YouTubeFacesDB', 'One line description of project.', 249 | 'Miscellaneous'), 250 | ] 251 | 252 | # Documents to append as an appendix to all manuals. 253 | #texinfo_appendices = [] 254 | 255 | # If false, no module index is generated. 256 | #texinfo_domain_indices = True 257 | 258 | # How to display URL addresses: 'footnote', 'no', or 'inline'. 259 | #texinfo_show_urls = 'footnote' 260 | 261 | # If true, do not generate a @detailmenu in the "Top" node's menu. 262 | #texinfo_no_detailmenu = False 263 | 264 | ######################################### 265 | # autodoc parameters 266 | ######################################### 267 | 268 | autodoc_member_order='groupwise' 269 | autoclass_content='both' 270 | -------------------------------------------------------------------------------- /docs/source/index.rst: -------------------------------------------------------------------------------- 1 | YouTubeFacesDB 2 | ============== 3 | 4 | Python module allowing to load the YouTube Faces Database: 5 | 6 | http://www.cs.tau.ac.il/~wolf/ytfaces/ 7 | 8 | **Description:** The data set contains 3,425 videos of 1,595 different 9 | people. All the videos were downloaded from YouTube. An average of 2.15 10 | videos are available for each subject. The shortest clip duration is 48 11 | frames, the longest clip is 6,070 frames, and the average length of a 12 | video clip is 181.3 frames. 13 | 14 | **For TUC users:** the DB is already downloaded on cortex at 15 | ``/work/biblio/youtube Faces DB`` (with the spaces). Copy it to your 16 | machine (in ``/scratch``, as it is over 25GB) and uncompress it. 17 | 18 | **Author:** Julien Vitay julien.vitay@informatik.tu-chemnitz.de 19 | 20 | **License:** MIT 21 | 22 | .. toctree:: 23 | :maxdepth: 4 24 | 25 | tutorial 26 | api 27 | 28 | 29 | 30 | 31 | 32 | 33 | -------------------------------------------------------------------------------- /docs/source/tutorial.rst: -------------------------------------------------------------------------------- 1 | YouTubeFacesDB 2 | ============== 3 | 4 | Python module allowing to load the YouTube Faces Database: 5 | 6 | http://www.cs.tau.ac.il/~wolf/ytfaces/ 7 | 8 | **Description:** The data set contains 3,425 videos of 1,595 different 9 | people. All the videos were downloaded from YouTube. An average of 2.15 10 | videos are available for each subject. The shortest clip duration is 48 11 | frames, the longest clip is 6,070 frames, and the average length of a 12 | video clip is 181.3 frames. 13 | 14 | **For TUC users:** the DB is already downloaded on cortex at 15 | ``/work/biblio/youtube Faces DB`` (with the spaces). Copy it to your 16 | machine (in ``/scratch``, as it is over 25GB) and uncompress it. 17 | 18 | **Author:** Julien Vitay julien.vitay@informatik.tu-chemnitz.de 19 | 20 | **License:** MIT 21 | 22 | Installation 23 | ------------ 24 | 25 | Apart from the usual python (2.7) + numpy dependencies, the module 26 | requires: 27 | 28 | - **Pillow** ``pip install Pillow --user`` for image processing. 29 | - **h5py** ``pip install h5py --user`` to manage the HDF5 files. 30 | ``libhdf5`` should also be installed through your package manager. 31 | 32 | The module can then be installed locally with: 33 | 34 | .. code:: bash 35 | 36 | python setup.py install --user 37 | 38 | To build the documentation, you will need Sphinx 39 | ``pip install Sphinx --user``. You can then go into the ``docs/`` 40 | directory and build it with: 41 | 42 | .. code:: bash 43 | 44 | make html 45 | 46 | You can then access ``docs/build/html/index.html`` with your browser. 47 | 48 | Tutorial 49 | -------- 50 | 51 | Transforming the YouTube Faces Database into a HDF5 file 52 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 53 | 54 | An example is provided in ``examples/GenerateSubset.py``. It accesses 55 | the dataset located at ``/scratch/vitay/Datasets/YouTubeFaces`` 56 | (``directory``), selects 10 random labels from it (``labels``), fetches 57 | all corresponding images (``max_number``), crops them to contains only 58 | the face area (``cropped``), transform them to luminance-based 59 | (``color``), resizes them to (100, 100) (``size``), prepends a dummy 60 | dimension to obtain a final numpy array of shape (1, 100, 100) 61 | (``bw_first``) and dumps them to the HDF5 file ``ytfdb.h5`` 62 | (``filename``). 63 | 64 | .. code:: python 65 | 66 | from YouTubeFacesDB import generate_ytf_database 67 | generate_ytf_database( 68 | directory= '/scratch/vitay/Datasets/YouTubeFaces', # Location of the YTF dataset 69 | filename='ytfdb.h5', # Name of the HDF5 file to write to 70 | labels=10, # Number of labels to randomly select 71 | max_number=-1, # Maximum number of images to use 72 | size=(100, 100), # Size of the images 73 | color=False, # Black and white 74 | bw_first=True, # Final shape is (1, w, h) 75 | cropped=True # The original images are cropped to the faces 76 | ) 77 | 78 | Check the doc of ``generate_ytf_database`` to see other arguments to 79 | this function. 80 | 81 | **Beware:** if you try to generate all color images of all labels with a 82 | size (100, 100), the process will take over half an hour and the HDF5 83 | file will be over 50GB, so do not save it in your home directory. 84 | 85 | Loading the HDF5 file for usage in Python 86 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 87 | 88 | Once the HDF5 file has been generated, you can use it in a Python for 89 | learning. An example is provided in ``examples/TrainKeras.py``, where a 90 | convolutional network written in Keras 91 | (``pip install Theano --user && pip install keras --user``) is trained 92 | on the data contained in ``ytfdb.h5``. 93 | 94 | Loading the dataset into memory 95 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 96 | 97 | To load the data, you need to create a ``YouTubeFacesDB`` object, pass 98 | him the path the HDF5 file and call the ``get()`` option: 99 | 100 | .. code:: python 101 | 102 | from YouTubeFacesDB import YouTubeFacesDB 103 | db = YouTubeFacesDB('ytfdb.h5') 104 | X, y = db.get() 105 | 106 | ``X`` is a numpy array containing all input images. The first index 107 | correspond to the image number, the remaining ones to the shape of the 108 | numpy array representing each image. This information can also be 109 | retrieved through the attributes of the object: 110 | 111 | .. code:: python 112 | 113 | N = db.nb_samples # number of samples, e.g. 10000 114 | d = db.input_dim # shape of the images, e.g. (1, 100, 100) 115 | 116 | ``y`` is a numpy array containing the label index for each image (in 117 | vectorized form, see *categorical outputs*). You can access the number 118 | of labels, as well as the list of labels easily: 119 | 120 | .. code:: python 121 | 122 | C = db.nb_classes # Number of classes 123 | labels = db.labels # List of strings for the labels 124 | 125 | Transforming the data 126 | ^^^^^^^^^^^^^^^^^^^^^ 127 | 128 | **Mean removal** 129 | 130 | ``X`` contains for each pixel a floating value between 0. and 1. (the 131 | conversion between integers [0..255] and floats [0...1] was done during 132 | the generation process). However, neural networks typically work much 133 | better when the input data has a zero mean. Fortunately, the mean input 134 | (i.e. the mean face) was also saved during the generation process. You 135 | can remove it from the input using: 136 | 137 | .. code:: python 138 | 139 | mean_face = db.mean 140 | X -= mean_face 141 | 142 | You can also tell the ``YouTubeFacesDB`` object to remove systematically 143 | this mean from the inputs: 144 | 145 | .. code:: python 146 | 147 | db = YouTubeFacesDB('ytfdb.h5', mean_removal=True) 148 | X, y = db.get() 149 | 150 | This way, ``X`` has a zero mean over the first axis, without needing to 151 | explicitly compute it. This is particularly useful when generating 152 | minibatches. 153 | 154 | **Categorical outputs** 155 | 156 | The outputs labels are originally integers between 0 and 157 | ``db.nb_classes`` - 1. To train neural networks, it often required to 158 | represent the output as binary arrays of length ``db.nb_classes``. where 159 | only one element is 1 and the rest 0. For example, the third class among 160 | 10 would be represented by ``0000000100``.This is the default 161 | representation returned by the ``YouTubeFacesDB`` object. 162 | 163 | If you prefer to get the labels as integers in ``y``, you can specify it 164 | in the constructor: 165 | 166 | .. code:: python 167 | 168 | db = YouTubeFacesDB('ytfdb.h5', output_type='integer') 169 | 170 | The default value of ``output_type`` is ``vector``. 171 | 172 | Splitting the data into training, validation and test sets 173 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 174 | 175 | ``db.get()`` returns by default the whole data. If you want to split 176 | this data into training, validation and test sets, you can call the 177 | method ``split_dataset()``: 178 | 179 | .. code:: python 180 | 181 | db.split_dataset(validation_size=0.2, test_size=0.1) 182 | 183 | In this example, the validation set will contain 20% of the samples and 184 | the test set 10%. The rest stays in the training set. The samples are 185 | randomly chosen in the data. To retrieve the corresponding data, provide 186 | an argument to ``get()``: 187 | 188 | .. code:: python 189 | 190 | db.split_dataset(validation_size=0.2, test_size=0.1) 191 | X_train, y_train = db.get('train') 192 | X_val, y_val = db.get('val') 193 | X_test, y_test = db.get('test') 194 | 195 | By default, the validation set has 20% of the data and the test set 0%. 196 | 197 | Generating minibatches 198 | ^^^^^^^^^^^^^^^^^^^^^^ 199 | 200 | Loading the whole dataset in memory with ``get()`` defeats the purpose 201 | of storing a large-scale dataset in a HDF5 file. In practice, it is 202 | recommended to load only minibatches (of let's say 1000 samples) one at 203 | a time, process them, and ask for a new one. 204 | 205 | The method ``generate_batches()`` returns a generator allowing to loop 206 | over a dataset and retrieve the data ``(X, y)`` for each minibatch: 207 | 208 | .. code:: python 209 | 210 | for X, y in db.generate_batches(batch_size=100, dset='train', rest=True): 211 | do_something(X, y) 212 | 213 | ``batch_size`` defines how many samples will be in each minibatch, 214 | ``dset`` from which dataset the samples will be taken 215 | (``['all', 'train', 'val', 'test']``) and ``rest`` what should be done 216 | with the last samples if the total number of samples is not a multiple 217 | of the batch size. For example, if the dataset has 1537 samples, and the 218 | batch size is 100, the ``for`` loop will be executed 15 times. The 219 | remaining 37 samples will be returned only if ``rest`` is set to True 220 | (as smaller batches may cause rpoblems with some tensor libraries). 221 | 222 | Between two calls to ``generate_batches()``, the indices are shuffled, 223 | so the minibatches will never be identical between epochs. 224 | 225 | The example in ``examples/TrainKeras-Generator.py`` shows how to use 226 | minibatches with Keras. Strangely, the ``fit_generator()`` method of 227 | Keras does not work with this generator, as Keras runs the generator in 228 | a separate thread and the h5py module does not seem to like it... 229 | -------------------------------------------------------------------------------- /examples/GenerateSubset.py: -------------------------------------------------------------------------------- 1 | from YouTubeFacesDB import generate_ytf_database 2 | 3 | ############################################################################### 4 | # Create the dataset 5 | ############################################################################### 6 | generate_ytf_database( 7 | directory= '../data',#'/scratch/vitay/Datasets/YouTubeFaces', # Location of the YTF dataset 8 | filename='ytfdb.h5', # Name of the HDF5 file to write to 9 | labels=10, # Number of labels to randomly select 10 | max_number=-1, # Maximum number of images to use 11 | size=(100, 100), # Size of the images 12 | color=False, # Black and white 13 | bw_first=True, # Final shape is (1, w, h) 14 | cropped=True # The original images are cropped to the faces 15 | ) -------------------------------------------------------------------------------- /examples/TrainKeras-Generator.py: -------------------------------------------------------------------------------- 1 | from __future__ import print_function, with_statement 2 | from time import time 3 | 4 | from YouTubeFacesDB import YouTubeFacesDB 5 | 6 | from keras.models import Sequential 7 | from keras.layers.core import Dense, Dropout, Activation, Flatten 8 | from keras.layers.convolutional import Convolution2D, MaxPooling2D 9 | from keras.optimizers import SGD, Adam, RMSprop, Adagrad, Adadelta 10 | from keras.regularizers import l2, activity_l2 11 | from keras.utils import np_utils 12 | 13 | 14 | ############################################################################### 15 | # Load the data from disk 16 | ############################################################################### 17 | tstart = time() 18 | 19 | db = YouTubeFacesDB('ytfdb.h5', mean_removal=True, output_type='vector') 20 | N = db.nb_samples 21 | d = db.input_dim 22 | C = db.nb_classes 23 | mean_face = db.mean 24 | 25 | print(N, 'images of size', d, 'loaded in', time()-tstart) 26 | 27 | ############################################################################### 28 | # Split into a training set and a test set 29 | ############################################################################### 30 | db.split_dataset(validation_size=0.25) 31 | 32 | ############################################################################### 33 | # Train a not very deep network 34 | ############################################################################### 35 | print('Create the network...') 36 | model = Sequential() 37 | 38 | # Convolutional input layer with maxpooling and dropout 39 | model.add(Convolution2D(16, 6, 6, border_mode='valid', input_shape=d)) 40 | model.add(Activation('relu')) 41 | model.add(MaxPooling2D(pool_size=(2, 2))) 42 | model.add(Dropout(0.5)) 43 | 44 | # Fully connected with ReLU and dropout 45 | model.add(Flatten()) 46 | model.add(Dense(100)) 47 | model.add(Activation('relu')) 48 | model.add(Dropout(0.5)) 49 | 50 | # Softmax output layer 51 | model.add(Dense(C)) 52 | model.add(Activation('softmax')) 53 | 54 | # Learning rule 55 | optimizer = SGD(lr=0.01, decay=1e-6, momentum=0.9, nesterov=True) 56 | model.compile(loss='categorical_crossentropy', optimizer=optimizer) 57 | 58 | # Training 59 | print('Start training...') 60 | nb_epochs = 10 61 | batch_size = 100 62 | try: 63 | for epoch in range(nb_epochs): 64 | print('Epoch', epoch+1, '/', nb_epochs) 65 | tstart = time() 66 | # Training 67 | batch_generator = db.generate_batches(batch_size, dset='train') 68 | nb_train_batches = 0; train_loss = 0.0; train_accuracy = 0.0 69 | for X, y in batch_generator: 70 | loss, accuracy = model.train_on_batch(X, y, accuracy=True) 71 | train_loss += loss 72 | train_accuracy += accuracy 73 | nb_train_batches += 1 74 | # Validation 75 | batch_generator = db.generate_batches(batch_size, dset='val') 76 | nb_val_batches = 0; val_loss = 0.0; val_accuracy = 0.0 77 | for X, y in batch_generator: 78 | loss, accuracy = model.test_on_batch(X, y, accuracy=True) 79 | val_loss += loss 80 | val_accuracy += accuracy 81 | nb_val_batches += 1 82 | # Verbose 83 | print('\tTraining loss:', train_loss/float(nb_train_batches), 'accuracy:', train_accuracy/float(nb_train_batches)) 84 | print('\tValidation loss:', val_loss/float(nb_val_batches), 'accuracy:', val_accuracy/float(nb_val_batches)) 85 | print('\tTook', time()-tstart) 86 | 87 | except (KeyboardInterrupt, ): 88 | pass 89 | 90 | # Validation 91 | print('Training finished.') 92 | batch_generator = db.generate_batches(batch_size, dset='val') 93 | nb_val_batches = 0; val_loss = 0.0; val_accuracy = 0.0 94 | for X, y in batch_generator: 95 | loss, accuracy = model.test_on_batch(X, y, accuracy=True) 96 | val_loss += loss 97 | val_accuracy += accuracy 98 | nb_val_batches += 1 99 | print('Test loss:', val_loss/float(nb_val_batches), 'accuracy:', val_accuracy/float(nb_val_batches)) -------------------------------------------------------------------------------- /examples/TrainKeras.py: -------------------------------------------------------------------------------- 1 | from __future__ import print_function, with_statement 2 | from time import time 3 | 4 | from YouTubeFacesDB import YouTubeFacesDB 5 | 6 | from keras.models import Sequential 7 | from keras.layers.core import Dense, Dropout, Activation, Flatten 8 | from keras.layers.convolutional import Convolution2D, MaxPooling2D 9 | from keras.optimizers import SGD, Adam, RMSprop, Adagrad, Adadelta 10 | from keras.regularizers import l2, activity_l2 11 | from keras.utils import np_utils 12 | 13 | 14 | ############################################################################### 15 | # Load the data from disk 16 | ############################################################################### 17 | tstart = time() 18 | 19 | db = YouTubeFacesDB('ytfdb.h5', mean_removal=True, output_type='vector') 20 | N = db.nb_samples 21 | d = db.input_dim 22 | C = db.nb_classes 23 | 24 | print(N, 'images of size', d, 'loaded in', time()-tstart) 25 | 26 | ############################################################################### 27 | # Split into a training set and a test set 28 | ############################################################################### 29 | db.split_dataset(validation_size=0.25) 30 | X_train, y_train = db.get('train') 31 | X_test, y_test = db.get('val') 32 | 33 | ############################################################################### 34 | # Train a not very deep network 35 | ############################################################################### 36 | print('Create the network...') 37 | model = Sequential() 38 | 39 | # Convolutional input layer with maxpooling and dropout 40 | model.add(Convolution2D(16, 6, 6, border_mode='valid', input_shape=d)) 41 | model.add(Activation('relu')) 42 | model.add(MaxPooling2D(pool_size=(2, 2))) 43 | model.add(Dropout(0.5)) 44 | 45 | # Fully connected with ReLU and dropout 46 | model.add(Flatten()) 47 | model.add(Dense(100)) 48 | model.add(Activation('relu')) 49 | model.add(Dropout(0.5)) 50 | 51 | # Softmax output layer 52 | model.add(Dense(C)) 53 | model.add(Activation('softmax')) 54 | 55 | # Learning rule 56 | optimizer = SGD(lr=0.01, decay=1e-6, momentum=0.9, nesterov=True) 57 | model.compile(loss='categorical_crossentropy', optimizer=optimizer) 58 | 59 | # Training 60 | print('Start training...') 61 | try: 62 | model.fit(X_train, y_train, 63 | batch_size=100, nb_epoch=10, 64 | show_accuracy=True, verbose=2, 65 | validation_data=(X_test, y_test)) 66 | except (KeyboardInterrupt, ): 67 | pass 68 | 69 | # Test on the validation set 70 | score = model.evaluate(X_test, y_test, 71 | show_accuracy=True, verbose=2) 72 | 73 | print('Training finished.') 74 | print('Test score:', score[0]) 75 | print('Test accuracy:', score[1]) -------------------------------------------------------------------------------- /setup.py: -------------------------------------------------------------------------------- 1 | import os 2 | from setuptools import setup 3 | 4 | # Utility function to read the README file. 5 | # Used for the long_description. It's nice, because now 1) we have a top level 6 | # README file and 2) it's easier to type in the README file than to put a raw 7 | # string in below ... 8 | def read(fname): 9 | return open(os.path.join(os.path.dirname(__file__), fname)).read() 10 | 11 | setup( 12 | name = "YouTubeFacesDB", 13 | version = "0.0.1", 14 | author = "Julien Vitay", 15 | author_email = "julien.vitay@gmail.com", 16 | description = ("Python scripts to load the YouTube Faces Database."), 17 | license = "MIT", 18 | keywords = "youtube faces database", 19 | url = "https://ai.informatik.tu-chemnitz.de/gogs/vitay/YouTubeFacesDB", 20 | packages=['YouTubeFacesDB'], 21 | long_description=read('README.md'), 22 | classifiers=[ 23 | "Development Status :: 3 - Alpha", 24 | "Topic :: Utilities", 25 | "License :: OSI Approved :: MIT License", 26 | ], 27 | ) --------------------------------------------------------------------------------