├── .gitignore
├── LICENSE
├── README.md
├── YouTubeFacesDB
    ├── Dataset.py
    ├── Generator.py
    └── __init__.py
├── docs
    ├── Makefile
    └── source
    │   ├── api.rst
    │   ├── conf.py
    │   ├── index.rst
    │   └── tutorial.rst
├── examples
    ├── GenerateSubset.py
    ├── TrainKeras-Generator.py
    └── TrainKeras.py
└── setup.py


/.gitignore:
--------------------------------------------------------------------------------
 1 | # ---> Python
 2 | # Byte-compiled / optimized / DLL files
 3 | __pycache__/
 4 | *.py[cod]
 5 | *$py.class
 6 | 
 7 | # C extensions
 8 | *.so
 9 | 
10 | # Distribution / packaging
11 | .Python
12 | env/
13 | build/
14 | develop-eggs/
15 | dist/
16 | downloads/
17 | eggs/
18 | .eggs/
19 | lib/
20 | lib64/
21 | parts/
22 | sdist/
23 | var/
24 | *.egg-info/
25 | .installed.cfg
26 | *.egg
27 | 
28 | # PyInstaller
29 | #  Usually these files are written by a python script from a template
30 | #  before PyInstaller builds the exe, so as to inject date/other infos into it.
31 | *.manifest
32 | *.spec
33 | 
34 | # Installer logs
35 | pip-log.txt
36 | pip-delete-this-directory.txt
37 | 
38 | # Unit test / coverage reports
39 | htmlcov/
40 | .tox/
41 | .coverage
42 | .coverage.*
43 | .cache
44 | nosetests.xml
45 | coverage.xml
46 | *,cover
47 | 
48 | # Translations
49 | *.mo
50 | *.pot
51 | 
52 | # Django stuff:
53 | *.log
54 | 
55 | # Sphinx documentation
56 | docs/_build/
57 | 
58 | # PyBuilder
59 | target/
60 | 
61 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
1 | MIT License
2 | Copyright (c) <year> <copyright holders>
3 | 
4 | Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
5 | 
6 | The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
7 | 
8 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
9 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # YouTubeFacesDB
  2 | 
  3 | Python module allowing to load the YouTube Faces Database:
  4 | 
  5 | <http://www.cs.tau.ac.il/~wolf/ytfaces/>
  6 | 
  7 | **Description:** The data set contains 3,425 videos of 1,595 different people. All the videos were downloaded from YouTube. An average of 2.15 videos are available for each subject. The shortest clip duration is 48 frames, the longest clip is 6,070 frames, and the average length of a video clip is 181.3 frames. 
  8 | 
  9 | **For TUC users:** the DB is already downloaded on cortex at `/work/biblio/youtube Faces DB` (with the spaces). Copy it to your machine (in `/scratch`, as it is over 25GB) and uncompress it.
 10 | 
 11 | **Author:** Julien Vitay <julien.vitay@informatik.tu-chemnitz.de>
 12 | 
 13 | **License:** MIT
 14 | 
 15 | ## Installation
 16 | 
 17 | Apart from the usual python (2.7) + numpy dependencies, the module requires:
 18 | 
 19 | * **Pillow** `pip install Pillow --user` for image processing.
 20 | * **h5py** `pip install h5py --user` to manage the HDF5 files. `libhdf5` should also be installed through your package manager.
 21 | 
 22 | The module can then be installed locally with:
 23 | 
 24 | ~~~bash
 25 | python setup.py install --user
 26 | ~~~
 27 | 
 28 | To build the documentation, you will need Sphinx `pip install Sphinx --user`. You can then go into the `docs/` directory and build it with:
 29 | 
 30 | ~~~bash
 31 | make html
 32 | ~~~
 33 | 
 34 | You can then access `docs/build/html/index.html` with your browser.
 35 | 
 36 | ## Tutorial
 37 | 
 38 | ### Transforming the YouTube Faces Database into a HDF5 file
 39 | 
 40 | An example is provided in `examples/GenerateSubset.py`. It accesses the dataset located at `/scratch/vitay/Datasets/YouTubeFaces` (`directory`), selects 10 random labels from it (`labels`), fetches all corresponding images (`max_number`), crops them to contains only the face area (`cropped`), transform them to luminance-based (`color`), resizes them to (100, 100) (`size`), prepends a dummy dimension to obtain a final numpy array of shape (1, 100, 100) (`bw_first`) and dumps them to the HDF5 file `ytfdb.h5` (`filename`).
 41 | 
 42 | ~~~python
 43 | from YouTubeFacesDB import generate_ytf_database
 44 | generate_ytf_database(  
 45 |     directory= '/scratch/vitay/Datasets/YouTubeFaces', # Location of the YTF dataset
 46 |     filename='ytfdb.h5', # Name of the HDF5 file to write to
 47 |     labels=10, # Number of labels to randomly select
 48 |     max_number=-1, # Maximum number of images to use
 49 |     size=(100, 100), # Size of the images
 50 |     color=False, # Black and white
 51 |     bw_first=True, # Final shape is (1, w, h)
 52 |     cropped=True # The original images are cropped to the faces
 53 | )
 54 | ~~~
 55 | 
 56 | Check the doc of `generate_ytf_database` to see other arguments to this function.
 57 | 
 58 | **Beware:** if you try to generate all color images of all labels with a size (100, 100), the process will take over half an hour and the HDF5 file will be over 50GB, so do not save it in your home directory.
 59 | 
 60 | ### Loading the HDF5 file for usage in Python
 61 | 
 62 | Once the HDF5 file has been generated, you can use it in a Python for learning. An example is provided in `examples/TrainKeras.py`, where a convolutional network written in Keras (`pip install Theano --user && pip install keras --user`) is trained on the data contained in `ytfdb.h5`. 
 63 | 
 64 | #### Loading the dataset into memory
 65 | 
 66 | To load the data, you need to create a `YouTubeFacesDB` object, pass him the path the HDF5 file and call the `get()` option:
 67 | 
 68 | ~~~python
 69 | from YouTubeFacesDB import YouTubeFacesDB
 70 | db = YouTubeFacesDB('ytfdb.h5')
 71 | X, y = db.get()
 72 | ~~~
 73 | 
 74 | `X` is a numpy array containing all input images. The first index correspond to the image number, the remaining ones to the shape of the numpy array representing each image. This information can also be retrieved through the attributes of the object:
 75 | 
 76 | ~~~python
 77 | N = db.nb_samples # number of samples, e.g. 10000
 78 | d = db.input_dim # shape of the images, e.g. (1, 100, 100)
 79 | ~~~
 80 | 
 81 | `y` is a numpy array containing the label index for each image (in vectorized form, see *categorical outputs*). You can access the number of labels, as well as the list of labels easily:
 82 | 
 83 | ~~~python
 84 | C = db.nb_classes # Number of classes
 85 | labels = db.labels # List of strings for the labels
 86 | ~~~
 87 | 
 88 | #### Transforming the data
 89 | 
 90 | **Mean removal** 
 91 | 
 92 | `X` contains for each pixel a floating value between 0. and 1. (the conversion between integers [0..255] and floats [0...1] was done during the generation process). However, neural networks typically work much better when the input data has a zero mean. Fortunately, the mean input (i.e. the mean face) was also saved during the generation process. You can remove it from the input using:
 93 | 
 94 | ~~~python
 95 | mean_face = db.mean
 96 | X -= mean_face
 97 | ~~~
 98 | 
 99 | You can also tell the `YouTubeFacesDB` object to remove systematically this mean from the inputs:
100 | 
101 | ~~~python
102 | db = YouTubeFacesDB('ytfdb.h5', mean_removal=True)
103 | X, y = db.get()
104 | ~~~
105 | 
106 | This way, `X` has a zero mean over the first axis, without needing to explicitly compute it. This is particularly useful when generating minibatches.
107 | 
108 | **Categorical outputs**
109 | 
110 | The outputs labels are originally integers between 0 and `db.nb_classes` - 1. To train neural networks, it often required to represent the output as binary arrays of length `db.nb_classes`. where only one element is 1 and the rest 0. For example, the third class among 10 would be represented by `0000000100`.This is the default representation returned by the `YouTubeFacesDB` object.
111 | 
112 | If you prefer to get the labels as integers in `y`, you can specify it in the constructor:
113 | 
114 | ~~~python
115 | db = YouTubeFacesDB('ytfdb.h5', output_type='integer')
116 | ~~~
117 | 
118 | The default value of `output_type` is `vector`.
119 | 
120 | 
121 | #### Splitting the data into training, validation and test sets
122 | 
123 | `db.get()` returns by default the whole data. If you want to split this data into training, validation and test sets, you can call the method `split_dataset()`:
124 | 
125 | ~~~python
126 | db.split_dataset(validation_size=0.2, test_size=0.1)
127 | ~~~
128 | 
129 | In this example, the validation set will contain 20% of the samples and the test set 10%. The rest stays in the training set. The samples are randomly chosen in the data. To retrieve the corresponding data, provide an argument to `get()`:
130 | 
131 | ~~~python
132 | db.split_dataset(validation_size=0.2, test_size=0.1)
133 | X_train, y_train = db.get('train')
134 | X_val, y_val = db.get('val')
135 | X_test, y_test = db.get('test')
136 | ~~~
137 | 
138 | By default, the validation set has 20% of the data and the test set 0%.
139 | 
140 | #### Generating minibatches
141 | 
142 | Loading the whole dataset in memory with `get()` defeats the purpose of storing a large-scale dataset in a HDF5 file. In practice, it is recommended to load only minibatches (of let's say 1000 samples) one at a time, process them, and ask for a new one.
143 | 
144 | The method `generate_batches()` returns a generator allowing to loop over a dataset and retrieve the data `(X, y)` for each minibatch:
145 | 
146 | ~~~python
147 | for X, y  in db.generate_batches(batch_size=100, dset='train', rest=True):
148 |     do_something(X, y)    
149 | ~~~
150 | 
151 | `batch_size` defines how many samples will be in each minibatch, `dset` from which dataset the samples will be taken (`['all', 'train', 'val', 'test']`) and `rest` what should be done with the last samples if the total number of samples is not a multiple of the batch size. For example, if the dataset has 1537 samples, and the batch size is 100, the `for` loop will be executed 15 times. The remaining 37 samples will be returned only if `rest` is set to True (as smaller batches may cause rpoblems with some tensor libraries).
152 | 
153 | Between two calls to `generate_batches()`, the indices are shuffled, so the minibatches will never be identical between epochs.
154 | 
155 | The example in `examples/TrainKeras-Generator.py` shows how to use minibatches with Keras. Strangely, the `fit_generator()` method of Keras does not work with this generator, as Keras runs the generator in a separate thread and the h5py module does not seem to like it...
156 | 
157 | 
158 | 
159 | 
160 | 
161 | 


--------------------------------------------------------------------------------
/YouTubeFacesDB/Dataset.py:
--------------------------------------------------------------------------------
  1 | # Standard library
  2 | from __future__ import print_function, with_statement
  3 | from time import time
  4 | import re
  5 | import os
  6 | import copy
  7 | import random
  8 | import csv
  9 | # Dependencies
 10 | import numpy as np
 11 | import h5py
 12 | from PIL import Image
 13 | 
 14 | 
 15 | def to_categorical(y, nb_classes=None):
 16 |     """
 17 |     Convert class vector (integers from 0 to nb_classes) to binary class matrix, for use with categorical_crossentropy.
 18 | 
 19 |     Taken from Keras.
 20 |     """
 21 |     Y = np.zeros((len(y), nb_classes))
 22 |     for i in range(len(y)):
 23 |         Y[i, y[i]] = 1.
 24 |     return Y
 25 | 
 26 | class YouTubeFacesDB(object):
 27 |     """
 28 |     Class allowing to interact with a HDF5 file containing a subset of the Youtube Faces dataset.
 29 |     """
 30 |     def __init__(self, filename, mean_removal=False, output_type='vector'):
 31 |         """
 32 |         Parameters:
 33 |         
 34 |         * `filename`: path to the HDF5 file containing the data.
 35 |         * `mean_removal`: defines if the mean image should be substracted from each image.
 36 |         * `output_type`: ['integer', 'vector'] defines the output for each sample. 'integer' will return the index of the class (e.g. 3), while vector will return a vector ith nb_classes components, all zero but one (e.g. 000...00100). Default: vector. 
 37 |         """
 38 |         # Open the file
 39 |         self.filename = filename
 40 |         try:
 41 |             self.f = h5py.File(self.filename, "r")
 42 |         except Exception:
 43 |             print('Error:', self.filename, 'does not exist.')
 44 | 
 45 |         # Data
 46 |         self._X = self.f.get('X')
 47 |         self._y = self.f.get('Y')
 48 | 
 49 |         # Mean input
 50 |         self.mean_removal = mean_removal
 51 |         self.mean = np.array(self.f.get('mean'))
 52 | 
 53 |         # Size
 54 |         shape = self._X.shape
 55 |         #: Total number of samples in the dataset
 56 |         self.nb_samples = shape[0]
 57 |         #: Shape of the inputs
 58 |         self.input_dim = shape[1:]
 59 | 
 60 |         # Indices
 61 |         self._indices = list(range(self.nb_samples))
 62 |         self._training_indices = self._indices
 63 |         self._validation_indices = []
 64 |         self._test_indices = []
 65 |         #: Number of samples in the training set
 66 |         self.nb_train = self.nb_samples
 67 |         #: Number of samples in the validation set
 68 |         self.nb_val = 0
 69 |         #: Number of samples in the test set
 70 |         self.nb_test = 0
 71 | 
 72 |         # Labels
 73 |         labels = self.f.get('labels')
 74 |         #: List of labels
 75 |         self.labels = []
 76 |         for label in labels:
 77 |             self.labels.append(str(label[0]))
 78 |         #: Total number of classes
 79 |         self.nb_classes = len(self.labels)
 80 |         if not output_type in ['integer', 'vector']:
 81 |             print("Error: output_type must be in ['integer', 'vector']")
 82 |             output_type = 'vector'
 83 |         #: Output type ['integer', 'vector']
 84 |         self.output_type = output_type
 85 | 
 86 |         #: Index of the video for each frame
 87 |         self.video = self.f.get('video')
 88 | 
 89 |     def split_dataset(self, validation_size=0.2, test_size=0.0):
 90 |         """
 91 |         Split the dataset into a training set, a validation set and optionally a test set.
 92 | 
 93 |         Parameters:
 94 | 
 95 |         * `validation_size`: proportion of the data in the validation set (default: 0.2)
 96 |         * `test_size`: proportion of the data in the test set (default: 0.0) 
 97 | 
 98 |         The split is only internal to the object (the method returns nothing), as the actual data should be later read from disk. 
 99 | 
100 |         This method sets the following attributes:
101 | 
102 |         * `self.nb_train`: number of samples in the training set.
103 |         * `self.nb_val`: number of samples in the validation set.
104 |         * `self.nb_test`: number of samples in the test set.
105 | 
106 |         To actually get the data, you will have to call either::
107 | 
108 |             X, y = db.get('all')
109 |             X_train, y_train = db.get('train')
110 |             X_val, y_cal = db.get('val')
111 |             X_test, y_test = db.get('test')
112 |         """
113 |         # Number of examples
114 |         self.nb_val = int(self.nb_samples*validation_size)
115 |         self.nb_test = int(self.nb_samples*test_size)
116 |         self.nb_train = self.nb_samples - self.nb_val - self.nb_test
117 |         # Compute the indices
118 |         indices = copy.deepcopy(self._indices)
119 |         random.shuffle(indices)
120 |         self._validation_indices = sorted(indices[:self.nb_val])
121 |         if self.nb_test != 0:
122 |             self._test_indices = sorted(indices[self.nb_val:self.nb_val+self.nb_test])
123 |         else:
124 |             self._test_indices = []
125 |         self._training_indices = sorted(indices[self.nb_val+self.nb_test:])
126 |         print('Training:', self.nb_train, '; Validation:', self.nb_val, '; Test:', self.nb_test, '; Total:', self.nb_samples)
127 | 
128 |     def get(self, dset='all'):
129 |         """
130 |         Returns the whole dataset as a tuple (X, y) of numpy arrays.
131 | 
132 |         Parameters:
133 | 
134 |         * `dset`: string in ['train', 'val', 'test', 'all'] for the desired part of the dataset (default: 'all').
135 | 
136 |         """
137 |         if dset == 'all':
138 |             X = np.array(self._X)
139 |             y = np.array(self._y, dtype='int32')
140 |         elif dset == 'train':
141 |             X = np.array(self._X[self._training_indices, ...])
142 |             y = np.array(self._y[self._training_indices, ...], dtype='int32')
143 |         elif dset == 'val':
144 |             X = np.array(self._X[self._validation_indices, ...])
145 |             y = np.array(self._y[self._validation_indices, ...], dtype='int32')
146 |         elif dset == 'test':
147 |             X = np.array(self._X[self._test_indices, ...])
148 |             y = np.array(self._y[self._test_indices, ...], dtype='int32')
149 |         else:
150 |             print("Error: the `dset` argument to get() must be in ['train', 'val', 'test', 'all']")
151 |             X = np.array([[]])
152 |             y = np.array([], dtype='int32')
153 | 
154 |         return self._transform_data(X, y)
155 | 
156 |     def _transform_data(self, X, y):
157 |         "Applies transformations to the data (mean_removal, output type..."
158 |         # Mean removal
159 |         if self.mean_removal:
160 |             X -= self.mean
161 | 
162 |         # Categorical outputs
163 |         if self.output_type == 'vector':
164 |             y = to_categorical(y, self.nb_classes)
165 | 
166 |         return X, y
167 | 
168 |     def generate_batches(self, batch_size, dset='all', rest=True):
169 |         """
170 |         Returns a minibatch of random samples of the DB as a (X, y) tuple every time it is called, until the dataset is fully seen.
171 | 
172 |         Parameters:
173 | 
174 |         * `batch_size`: number of samples per minibatch.
175 |         * `dset`: string in ['train', 'val', 'test', 'all'] for the desired part of the dataset (default: 'all').
176 |         * `rest`: defines if the remaining samples after the last full minibatch should be sent anyway (default: True)
177 |         """     
178 |         # Access the dataset indices 
179 |         if dset=='train':
180 |             indices = copy.deepcopy(self._training_indices)
181 |             N = self.nb_train
182 |         elif dset=='val':
183 |             indices = copy.deepcopy(self._validation_indices)
184 |             N = self.nb_val
185 |         elif dset=='test':
186 |             indices = copy.deepcopy(self._test_indices)
187 |             N = self.nb_test
188 |         elif dset=='all':
189 |             indices = copy.deepcopy(self._indices)
190 |             N = self.nb_samples
191 |         else:
192 |             print("Error: the `dset` argument to get_batch() must be in ['train', 'val', 'test', 'all']")
193 |             return
194 | 
195 |         # Compute the number of minibatches
196 |         nb_batches = int(N/batch_size)
197 |         rest_batches = N - nb_batches*batch_size # what to do with the rest?
198 | 
199 |         # Shuffle the training set
200 |         random.shuffle(indices)
201 | 
202 |         # Iterate over the minibatches
203 |         for b in range(nb_batches):
204 |             samples = sorted(indices[b*batch_size:(b+1)*batch_size])
205 |             X = np.array(self._X[samples, ...])
206 |             y = np.array(self._y[samples, ...], dtype='int32')
207 |             X, y = self._transform_data(X, y)
208 |             yield X, y
209 | 
210 |         # Throw the rest. May be inefficient.
211 |         if rest_batches != 0 and rest:
212 |             samples = sorted(indices[nb_batches*batch_size:])
213 |             X = np.array(self._X[samples, ...])
214 |             y = np.array(self._y[samples, ...], dtype='int32')
215 |             X, y = self._transform_data(X, y)
216 |             yield X, y
217 | 
218 | 
219 | 
220 |         
221 | 
222 | 
223 | 


--------------------------------------------------------------------------------
/YouTubeFacesDB/Generator.py:
--------------------------------------------------------------------------------
  1 | # Standard library
  2 | from __future__ import print_function, with_statement
  3 | from time import time
  4 | import re
  5 | import os
  6 | import random
  7 | import csv
  8 | # Dependencies
  9 | import numpy as np
 10 | import h5py
 11 | from PIL import Image
 12 | 
 13 | # Structure of the YFT directory
 14 | original_folder = '/frame_images_DB/'
 15 | aligned_folder = '/aligned_images_DB/'
 16 | 
 17 | def _get_labels(directory):
 18 | 	"Retrieves the list of labels from the aligned directory"
 19 | 	return sorted(os.listdir(directory + aligned_folder), key=lambda s: s.lower())
 20 | 
 21 | def _check_labels(labels, directory):
 22 | 	"Compares the provided list of labels to ones which exist."
 23 | 	orig = _get_labels(directory)
 24 | 	for label in labels:
 25 | 		if not label in orig:
 26 | 			print('Error:', label, 'does not exist in the YouTube Faces database.')
 27 | 			exit(0)
 28 | 
 29 | 
 30 | def _gather_images_info(directory, labels, max_images_per_person):
 31 | 	"Iterates over all labels and gets the filenames and crop information"
 32 | 	data = []
 33 | 	for name in labels:
 34 | 		# Each image is described in frame_images_DB/Aaron_Eckhart.labeled_faces.txt
 35 | 		data_file = directory + original_folder + name + '.labeled_faces.txt'
 36 | 		# Read the file
 37 | 		data_person = []
 38 | 		try:
 39 | 			with open(data_file, 'r') as csvfile:
 40 | 				for entry in csv.reader(csvfile, delimiter=','):
 41 | 					img_name = entry[0].replace('\\', '/')
 42 | 					center_w, center_h = int(entry[2]), int(entry[3])
 43 | 					size_w, size_h = int(entry[4]), int(entry[5])
 44 | 					data_person.append({
 45 | 						'name': name,
 46 | 						'filename': img_name,
 47 | 						'center': (center_w, center_h),
 48 | 						'size': (size_w, size_h)
 49 | 						})
 50 | 		except Exception as e:
 51 | 			print('Error: could not read', data_file)
 52 | 			print(e)
 53 | 			return data
 54 | 		# Possibly select a maximal number of them
 55 | 		if max_images_per_person == -1: # everything
 56 | 			data.extend(data_person)
 57 | 		else:
 58 | 			data.extend(random.sample(data_person, max_images_per_person))
 59 | 
 60 | 	return data
 61 | 
 62 | def _create_db(directory, metadata, labels, filename, size, color, rgb_first, bw_first, cropped):
 63 | 	"Main method to fetch all images into the hdf5 DB."
 64 | 	# Total number of images
 65 | 	nb_images = len(metadata)
 66 | 	# Final size of the image
 67 | 	if color and rgb_first:
 68 | 		final_size = (3, ) # channel is first
 69 | 	elif not color and bw_first: # add a dummy (1,) in front
 70 | 			final_size = (1, )
 71 | 	else:
 72 | 		final_size = ()			
 73 | 	final_size += size
 74 | 	if color and not rgb_first:
 75 | 		final_size += (3,)
 76 | 	print('Final size of the images:', final_size)
 77 | 	# Initialize the hdf5 DB
 78 | 	f = h5py.File(filename, "w")
 79 | 	dset_X = f.create_dataset("X", (nb_images,) + final_size, dtype='f')
 80 | 	dset_Y = f.create_dataset("Y", (nb_images,), dtype='i')
 81 | 	dset_video = f.create_dataset("video", (nb_images,), dtype='i')
 82 | 	# Save the list of labels
 83 | 	max_length = 0
 84 | 	for label in labels:
 85 | 		max_length = max(max_length, len(label))
 86 | 	asciiList = [n.encode("ascii", "ignore") for n in labels]
 87 | 	f.create_dataset('labels', (len(labels),1),'S'+str(max_length), labels)
 88 | 	# Compute the mean image
 89 | 	mean_img = np.zeros(final_size)
 90 | 	# Iterate over all images
 91 | 	for idx in range(nb_images):
 92 | 		# Retrieve the info
 93 | 		description= metadata[idx] # description
 94 | 		name = description['name'] # name of the person
 95 | 		y = labels.index(name) # corresponding index between 0 and 1594
 96 | 		filename = description['filename'] # complete filename
 97 | 		video_idx = int(re.findall(r'/([\d]+)/', filename)[0]) # index of the video
 98 | 		center_w, center_h = description['center'] # center of the face
 99 | 		size_w, size_h = description['size'] # size of the face
100 | 		# Get the image
101 | 		img_file_path = directory + original_folder + filename
102 | 		img = Image.open(img_file_path)
103 | 		# Crop the image to the face
104 | 		if cropped:
105 | 			img = img.crop((center_w - size_w/2, center_h - size_h/2, center_w + size_w/2, center_h + size_h/2))
106 | 		# Resize the image
107 | 		img = img.resize(size)
108 | 		# Color
109 | 		if not color:
110 | 			img = img.convert('L')
111 | 		# Get the numpy array
112 | 		img_data = np.array(img).astype('float32')/255.
113 | 		# Swap the axes (to have (3, w, h))
114 | 		if color and rgb_first:
115 | 			img_data = img_data.swapaxes(0, 2)
116 | 		# Add a dummy first axis to BW images for theano
117 | 		if not color and bw_first:
118 | 			img_data = img_data[np.newaxis, :, :]
119 | 		# Update the mean
120 | 		mean_img += (img_data - mean_img)/float(idx+1)
121 | 		# Push it to the HDF5 file
122 | 		dset_X[idx, ...] = img_data
123 | 		dset_Y[idx] = y
124 | 		dset_video[idx] = video_idx
125 | 	# Last, save the mean
126 | 	f.create_dataset('mean', (1, )+final_size,'f', mean_img)
127 | 
128 | def generate_ytf_database(
129 | 	directory, 
130 | 	filename, 
131 | 	size, 
132 | 	labels=None,
133 | 	max_number=-1, 
134 | 	max_images_per_person=-1, 
135 | 	color=True, 
136 | 	rgb_first=True, 
137 | 	bw_first=False, 
138 | 	cropped=True):
139 | 	"""
140 | 	Method to generate a subset of the YouTube Faces database in a HDF5 file.
141 | 
142 | 	Arguments:
143 | 
144 | 	* `directory`: director where the YouTube Face DB is located.
145 | 	* `filename`: path and name of the hdf5 file where the DB will be saved.
146 | 	* `size`: (width, height) size for the extracted images.
147 | 	* `labels`: number or list of labels which should be used (default: None, for all labels).
148 | 	* `max_number`: maximum number of images (default: -1, all of them).
149 | 	* `max_images_per_person`: maximum number of images which should be extracted per person (default: -1, all images)
150 | 	* `color`: if the color channels should be preserved (default: True) 
151 | 	* `rgb_first`: if True, the numpy arrays of colored images will have the shape (3, w, h), otherwise (w, h, 3) (default: True). Useful for Theano backends.
152 | 	* `bw_first`: if True, the numpy arrays of black&white images will have the shape (1, w, h), otherwise (w, h) (default: False). Useful for Theano backends.
153 | 	* `cropped`: if the images should be cropped around the detected face (default: True)
154 | 	"""
155 | 	tstart = time()
156 | 	# Get the labels
157 | 	if labels==None or labels == -1:
158 | 		print('Retrieving all labels...')
159 | 		labels = _get_labels(directory)
160 | 	elif isinstance(labels, int):
161 | 		print('Generating',  labels, 'labels randomly...')
162 | 		nb_labels = labels
163 | 		orig = _get_labels(directory)
164 | 		if nb_labels >= len(orig):
165 | 			print('There are only', len(orig), 'labels in the database...')
166 | 			labels = orig
167 | 		else:
168 | 			labels = sorted(random.sample(orig, nb_labels), key=lambda s: s.lower())
169 | 			for label in labels:
170 | 				print('\t', label)
171 | 	else:
172 | 		print('Checking the labels...')
173 | 		_check_labels(labels, directory)
174 | 
175 | 	# Retrieve the metadata on all images
176 | 	print('Gathering image locations...')
177 | 	metadata = _gather_images_info(directory, labels, max_images_per_person)
178 | 	nb_images = len(metadata)
179 | 	print('Found', nb_images, 'images for', len(labels), 'people.')
180 | 
181 | 	# Reduce the number of images
182 | 	if max_number != -1 and max_number < nb_images:
183 | 		print('Reducing this number to', max_number)
184 | 		metadata = random.sample(metadata, max_number)
185 | 
186 | 	# Get all the images, crop/resize them, and save them into a hdf5 file
187 | 	_create_db(directory, metadata, labels, filename, size, color, rgb_first, bw_first, cropped)
188 | 	print('Done in', time()-tstart, 'seconds.')
189 | 


--------------------------------------------------------------------------------
/YouTubeFacesDB/__init__.py:
--------------------------------------------------------------------------------
1 | from Generator import generate_ytf_database
2 | from Dataset import YouTubeFacesDB
3 | 


--------------------------------------------------------------------------------
/docs/Makefile:
--------------------------------------------------------------------------------
  1 | # Makefile for Sphinx documentation
  2 | #
  3 | 
  4 | # You can set these variables from the command line.
  5 | SPHINXOPTS    =
  6 | SPHINXBUILD   = sphinx-build
  7 | PAPER         =
  8 | BUILDDIR      = build
  9 | 
 10 | # User-friendly check for sphinx-build
 11 | ifeq ($(shell which $(SPHINXBUILD) >/dev/null 2>&1; echo $$?), 1)
 12 | $(error The '$(SPHINXBUILD)' command was not found. Make sure you have Sphinx installed, then set the SPHINXBUILD environment variable to point to the full path of the '$(SPHINXBUILD)' executable. Alternatively you can add the directory with the executable to your PATH. If you don't have Sphinx installed, grab it from http://sphinx-doc.org/)
 13 | endif
 14 | 
 15 | # Internal variables.
 16 | PAPEROPT_a4     = -D latex_paper_size=a4
 17 | PAPEROPT_letter = -D latex_paper_size=letter
 18 | ALLSPHINXOPTS   = -d $(BUILDDIR)/doctrees $(PAPEROPT_$(PAPER)) $(SPHINXOPTS) source
 19 | # the i18n builder cannot share the environment and doctrees with the others
 20 | I18NSPHINXOPTS  = $(PAPEROPT_$(PAPER)) $(SPHINXOPTS) source
 21 | 
 22 | .PHONY: help clean html dirhtml singlehtml pickle json htmlhelp qthelp devhelp epub latex latexpdf text man changes linkcheck doctest gettext
 23 | 
 24 | help:
 25 | 	@echo "Please use \`make <target>' where <target> is one of"
 26 | 	@echo "  html       to make standalone HTML files"
 27 | 	@echo "  dirhtml    to make HTML files named index.html in directories"
 28 | 	@echo "  singlehtml to make a single large HTML file"
 29 | 	@echo "  pickle     to make pickle files"
 30 | 	@echo "  json       to make JSON files"
 31 | 	@echo "  htmlhelp   to make HTML files and a HTML help project"
 32 | 	@echo "  qthelp     to make HTML files and a qthelp project"
 33 | 	@echo "  devhelp    to make HTML files and a Devhelp project"
 34 | 	@echo "  epub       to make an epub"
 35 | 	@echo "  latex      to make LaTeX files, you can set PAPER=a4 or PAPER=letter"
 36 | 	@echo "  latexpdf   to make LaTeX files and run them through pdflatex"
 37 | 	@echo "  latexpdfja to make LaTeX files and run them through platex/dvipdfmx"
 38 | 	@echo "  text       to make text files"
 39 | 	@echo "  man        to make manual pages"
 40 | 	@echo "  texinfo    to make Texinfo files"
 41 | 	@echo "  info       to make Texinfo files and run them through makeinfo"
 42 | 	@echo "  gettext    to make PO message catalogs"
 43 | 	@echo "  changes    to make an overview of all changed/added/deprecated items"
 44 | 	@echo "  xml        to make Docutils-native XML files"
 45 | 	@echo "  pseudoxml  to make pseudoxml-XML files for display purposes"
 46 | 	@echo "  linkcheck  to check all external links for integrity"
 47 | 	@echo "  doctest    to run all doctests embedded in the documentation (if enabled)"
 48 | 
 49 | clean:
 50 | 	rm -rf $(BUILDDIR)/*
 51 | 
 52 | readme:
 53 | 	pandoc ../README.md -o source/tutorial.rst
 54 | 
 55 | html:
 56 | 	$(SPHINXBUILD) -b html $(ALLSPHINXOPTS) $(BUILDDIR)/html
 57 | 	@echo
 58 | 	@echo "Build finished. The HTML pages are in $(BUILDDIR)/html."
 59 | 
 60 | dirhtml:
 61 | 	$(SPHINXBUILD) -b dirhtml $(ALLSPHINXOPTS) $(BUILDDIR)/dirhtml
 62 | 	@echo
 63 | 	@echo "Build finished. The HTML pages are in $(BUILDDIR)/dirhtml."
 64 | 
 65 | singlehtml:
 66 | 	$(SPHINXBUILD) -b singlehtml $(ALLSPHINXOPTS) $(BUILDDIR)/singlehtml
 67 | 	@echo
 68 | 	@echo "Build finished. The HTML page is in $(BUILDDIR)/singlehtml."
 69 | 
 70 | pickle:
 71 | 	$(SPHINXBUILD) -b pickle $(ALLSPHINXOPTS) $(BUILDDIR)/pickle
 72 | 	@echo
 73 | 	@echo "Build finished; now you can process the pickle files."
 74 | 
 75 | json:
 76 | 	$(SPHINXBUILD) -b json $(ALLSPHINXOPTS) $(BUILDDIR)/json
 77 | 	@echo
 78 | 	@echo "Build finished; now you can process the JSON files."
 79 | 
 80 | htmlhelp:
 81 | 	$(SPHINXBUILD) -b htmlhelp $(ALLSPHINXOPTS) $(BUILDDIR)/htmlhelp
 82 | 	@echo
 83 | 	@echo "Build finished; now you can run HTML Help Workshop with the" \
 84 | 	      ".hhp project file in $(BUILDDIR)/htmlhelp."
 85 | 
 86 | qthelp:
 87 | 	$(SPHINXBUILD) -b qthelp $(ALLSPHINXOPTS) $(BUILDDIR)/qthelp
 88 | 	@echo
 89 | 	@echo "Build finished; now you can run "qcollectiongenerator" with the" \
 90 | 	      ".qhcp project file in $(BUILDDIR)/qthelp, like this:"
 91 | 	@echo "# qcollectiongenerator $(BUILDDIR)/qthelp/YouTubeFacesDB.qhcp"
 92 | 	@echo "To view the help file:"
 93 | 	@echo "# assistant -collectionFile $(BUILDDIR)/qthelp/YouTubeFacesDB.qhc"
 94 | 
 95 | devhelp:
 96 | 	$(SPHINXBUILD) -b devhelp $(ALLSPHINXOPTS) $(BUILDDIR)/devhelp
 97 | 	@echo
 98 | 	@echo "Build finished."
 99 | 	@echo "To view the help file:"
100 | 	@echo "# mkdir -p $$HOME/.local/share/devhelp/YouTubeFacesDB"
101 | 	@echo "# ln -s $(BUILDDIR)/devhelp $$HOME/.local/share/devhelp/YouTubeFacesDB"
102 | 	@echo "# devhelp"
103 | 
104 | epub:
105 | 	$(SPHINXBUILD) -b epub $(ALLSPHINXOPTS) $(BUILDDIR)/epub
106 | 	@echo
107 | 	@echo "Build finished. The epub file is in $(BUILDDIR)/epub."
108 | 
109 | latex:
110 | 	$(SPHINXBUILD) -b latex $(ALLSPHINXOPTS) $(BUILDDIR)/latex
111 | 	@echo
112 | 	@echo "Build finished; the LaTeX files are in $(BUILDDIR)/latex."
113 | 	@echo "Run \`make' in that directory to run these through (pdf)latex" \
114 | 	      "(use \`make latexpdf' here to do that automatically)."
115 | 
116 | latexpdf:
117 | 	$(SPHINXBUILD) -b latex $(ALLSPHINXOPTS) $(BUILDDIR)/latex
118 | 	@echo "Running LaTeX files through pdflatex..."
119 | 	$(MAKE) -C $(BUILDDIR)/latex all-pdf
120 | 	@echo "pdflatex finished; the PDF files are in $(BUILDDIR)/latex."
121 | 
122 | latexpdfja:
123 | 	$(SPHINXBUILD) -b latex $(ALLSPHINXOPTS) $(BUILDDIR)/latex
124 | 	@echo "Running LaTeX files through platex and dvipdfmx..."
125 | 	$(MAKE) -C $(BUILDDIR)/latex all-pdf-ja
126 | 	@echo "pdflatex finished; the PDF files are in $(BUILDDIR)/latex."
127 | 
128 | text:
129 | 	$(SPHINXBUILD) -b text $(ALLSPHINXOPTS) $(BUILDDIR)/text
130 | 	@echo
131 | 	@echo "Build finished. The text files are in $(BUILDDIR)/text."
132 | 
133 | man:
134 | 	$(SPHINXBUILD) -b man $(ALLSPHINXOPTS) $(BUILDDIR)/man
135 | 	@echo
136 | 	@echo "Build finished. The manual pages are in $(BUILDDIR)/man."
137 | 
138 | texinfo:
139 | 	$(SPHINXBUILD) -b texinfo $(ALLSPHINXOPTS) $(BUILDDIR)/texinfo
140 | 	@echo
141 | 	@echo "Build finished. The Texinfo files are in $(BUILDDIR)/texinfo."
142 | 	@echo "Run \`make' in that directory to run these through makeinfo" \
143 | 	      "(use \`make info' here to do that automatically)."
144 | 
145 | info:
146 | 	$(SPHINXBUILD) -b texinfo $(ALLSPHINXOPTS) $(BUILDDIR)/texinfo
147 | 	@echo "Running Texinfo files through makeinfo..."
148 | 	make -C $(BUILDDIR)/texinfo info
149 | 	@echo "makeinfo finished; the Info files are in $(BUILDDIR)/texinfo."
150 | 
151 | gettext:
152 | 	$(SPHINXBUILD) -b gettext $(I18NSPHINXOPTS) $(BUILDDIR)/locale
153 | 	@echo
154 | 	@echo "Build finished. The message catalogs are in $(BUILDDIR)/locale."
155 | 
156 | changes:
157 | 	$(SPHINXBUILD) -b changes $(ALLSPHINXOPTS) $(BUILDDIR)/changes
158 | 	@echo
159 | 	@echo "The overview file is in $(BUILDDIR)/changes."
160 | 
161 | linkcheck:
162 | 	$(SPHINXBUILD) -b linkcheck $(ALLSPHINXOPTS) $(BUILDDIR)/linkcheck
163 | 	@echo
164 | 	@echo "Link check complete; look for any errors in the above output " \
165 | 	      "or in $(BUILDDIR)/linkcheck/output.txt."
166 | 
167 | doctest:
168 | 	$(SPHINXBUILD) -b doctest $(ALLSPHINXOPTS) $(BUILDDIR)/doctest
169 | 	@echo "Testing of doctests in the sources finished, look at the " \
170 | 	      "results in $(BUILDDIR)/doctest/output.txt."
171 | 
172 | xml:
173 | 	$(SPHINXBUILD) -b xml $(ALLSPHINXOPTS) $(BUILDDIR)/xml
174 | 	@echo
175 | 	@echo "Build finished. The XML files are in $(BUILDDIR)/xml."
176 | 
177 | pseudoxml:
178 | 	$(SPHINXBUILD) -b pseudoxml $(ALLSPHINXOPTS) $(BUILDDIR)/pseudoxml
179 | 	@echo
180 | 	@echo "Build finished. The pseudo-XML files are in $(BUILDDIR)/pseudoxml."
181 | 


--------------------------------------------------------------------------------
/docs/source/api.rst:
--------------------------------------------------------------------------------
 1 | Documentation
 2 | =============
 3 | 
 4 | Method ``generate_ytf_database``
 5 | --------------------------------
 6 | 
 7 | .. autofunction:: YouTubeFacesDB.generate_ytf_database
 8 | 
 9 | Class ``YouTubeFacesDB``
10 | ------------------------
11 | 
12 | .. autoclass:: YouTubeFacesDB.YouTubeFacesDB
13 |     :members:


--------------------------------------------------------------------------------
/docs/source/conf.py:
--------------------------------------------------------------------------------
  1 | # -*- coding: utf-8 -*-
  2 | #
  3 | # YouTubeFacesDB documentation build configuration file, created by
  4 | # sphinx-quickstart on Wed Feb 10 18:14:23 2016.
  5 | #
  6 | # This file is execfile()d with the current directory set to its
  7 | # containing dir.
  8 | #
  9 | # Note that not all possible configuration values are present in this
 10 | # autogenerated file.
 11 | #
 12 | # All configuration values have a default; values that are commented out
 13 | # serve to show the default.
 14 | 
 15 | import sys
 16 | import os
 17 | 
 18 | # If extensions (or modules to document with autodoc) are in another directory,
 19 | # add these directories to sys.path here. If the directory is relative to the
 20 | # documentation root, use os.path.abspath to make it absolute, like shown here.
 21 | sys.path.insert(0, os.path.abspath('..'))
 22 | 
 23 | # -- General configuration ------------------------------------------------
 24 | 
 25 | # If your documentation needs a minimal Sphinx version, state it here.
 26 | #needs_sphinx = '1.0'
 27 | 
 28 | # Add any Sphinx extension module names here, as strings. They can be
 29 | # extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
 30 | # ones.
 31 | extensions = [
 32 |     'sphinx.ext.autodoc',
 33 |     'sphinx.ext.mathjax',
 34 |     'sphinx.ext.viewcode',
 35 | ]
 36 | 
 37 | # Add any paths that contain templates here, relative to this directory.
 38 | templates_path = ['_templates']
 39 | 
 40 | # The suffix of source filenames.
 41 | source_suffix = '.rst'
 42 | 
 43 | # The encoding of source files.
 44 | #source_encoding = 'utf-8-sig'
 45 | 
 46 | # The master toctree document.
 47 | master_doc = 'index'
 48 | 
 49 | # General information about the project.
 50 | project = u'YouTubeFacesDB'
 51 | copyright = u'2016, Julien Vitay'
 52 | 
 53 | # The version info for the project you're documenting, acts as replacement for
 54 | # |version| and |release|, also used in various other places throughout the
 55 | # built documents.
 56 | #
 57 | # The short X.Y version.
 58 | version = '0.0.1'
 59 | # The full version, including alpha/beta/rc tags.
 60 | release = '0.0.1'
 61 | 
 62 | # The language for content autogenerated by Sphinx. Refer to documentation
 63 | # for a list of supported languages.
 64 | #language = None
 65 | 
 66 | # There are two options for replacing |today|: either, you set today to some
 67 | # non-false value, then it is used:
 68 | #today = ''
 69 | # Else, today_fmt is used as the format for a strftime call.
 70 | #today_fmt = '%B %d, %Y'
 71 | 
 72 | # List of patterns, relative to source directory, that match files and
 73 | # directories to ignore when looking for source files.
 74 | exclude_patterns = []
 75 | 
 76 | # The reST default role (used for this markup: `text`) to use for all
 77 | # documents.
 78 | #default_role = None
 79 | 
 80 | # If true, '()' will be appended to :func: etc. cross-reference text.
 81 | #add_function_parentheses = True
 82 | 
 83 | # If true, the current module name will be prepended to all description
 84 | # unit titles (such as .. function::).
 85 | #add_module_names = True
 86 | 
 87 | # If true, sectionauthor and moduleauthor directives will be shown in the
 88 | # output. They are ignored by default.
 89 | #show_authors = False
 90 | 
 91 | # The name of the Pygments (syntax highlighting) style to use.
 92 | pygments_style = 'sphinx'
 93 | 
 94 | # A list of ignored prefixes for module index sorting.
 95 | #modindex_common_prefix = []
 96 | 
 97 | # If true, keep warnings as "system message" paragraphs in the built documents.
 98 | #keep_warnings = False
 99 | 
100 | 
101 | # -- Options for HTML output ----------------------------------------------
102 | 
103 | # The theme to use for HTML and HTML Help pages.  See the documentation for
104 | # a list of builtin themes.
105 | html_theme = 'haiku'
106 | 
107 | # Theme options are theme-specific and customize the look and feel of a theme
108 | # further.  For a list of options available for each theme, see the
109 | # documentation.
110 | #html_theme_options = {}
111 | 
112 | # Add any paths that contain custom themes here, relative to this directory.
113 | #html_theme_path = []
114 | 
115 | # The name for this set of Sphinx documents.  If None, it defaults to
116 | # "<project> v<release> documentation".
117 | #html_title = None
118 | 
119 | # A shorter title for the navigation bar.  Default is the same as html_title.
120 | #html_short_title = None
121 | 
122 | # The name of an image file (relative to this directory) to place at the top
123 | # of the sidebar.
124 | #html_logo = None
125 | 
126 | # The name of an image file (within the static path) to use as favicon of the
127 | # docs.  This file should be a Windows icon file (.ico) being 16x16 or 32x32
128 | # pixels large.
129 | #html_favicon = None
130 | 
131 | # Add any paths that contain custom static files (such as style sheets) here,
132 | # relative to this directory. They are copied after the builtin static files,
133 | # so a file named "default.css" will overwrite the builtin "default.css".
134 | html_static_path = ['_static']
135 | 
136 | # Add any extra paths that contain custom files (such as robots.txt or
137 | # .htaccess) here, relative to this directory. These files are copied
138 | # directly to the root of the documentation.
139 | #html_extra_path = []
140 | 
141 | # If not '', a 'Last updated on:' timestamp is inserted at every page bottom,
142 | # using the given strftime format.
143 | #html_last_updated_fmt = '%b %d, %Y'
144 | 
145 | # If true, SmartyPants will be used to convert quotes and dashes to
146 | # typographically correct entities.
147 | #html_use_smartypants = True
148 | 
149 | # Custom sidebar templates, maps document names to template names.
150 | #html_sidebars = {}
151 | 
152 | # Additional templates that should be rendered to pages, maps page names to
153 | # template names.
154 | #html_additional_pages = {}
155 | 
156 | # If false, no module index is generated.
157 | #html_domain_indices = True
158 | 
159 | # If false, no index is generated.
160 | #html_use_index = True
161 | 
162 | # If true, the index is split into individual pages for each letter.
163 | #html_split_index = False
164 | 
165 | # If true, links to the reST sources are added to the pages.
166 | #html_show_sourcelink = True
167 | 
168 | # If true, "Created using Sphinx" is shown in the HTML footer. Default is True.
169 | #html_show_sphinx = True
170 | 
171 | # If true, "(C) Copyright ..." is shown in the HTML footer. Default is True.
172 | #html_show_copyright = True
173 | 
174 | # If true, an OpenSearch description file will be output, and all pages will
175 | # contain a <link> tag referring to it.  The value of this option must be the
176 | # base URL from which the finished HTML is served.
177 | #html_use_opensearch = ''
178 | 
179 | # This is the file name suffix for HTML files (e.g. ".xhtml").
180 | #html_file_suffix = None
181 | 
182 | # Output file base name for HTML help builder.
183 | htmlhelp_basename = 'YouTubeFacesDBdoc'
184 | 
185 | 
186 | # -- Options for LaTeX output ---------------------------------------------
187 | 
188 | latex_elements = {
189 | # The paper size ('letterpaper' or 'a4paper').
190 | #'papersize': 'letterpaper',
191 | 
192 | # The font size ('10pt', '11pt' or '12pt').
193 | #'pointsize': '10pt',
194 | 
195 | # Additional stuff for the LaTeX preamble.
196 | #'preamble': '',
197 | }
198 | 
199 | # Grouping the document tree into LaTeX files. List of tuples
200 | # (source start file, target name, title,
201 | #  author, documentclass [howto, manual, or own class]).
202 | latex_documents = [
203 |   ('index', 'YouTubeFacesDB.tex', u'YouTubeFacesDB Documentation',
204 |    u'Julien Vitay', 'manual'),
205 | ]
206 | 
207 | # The name of an image file (relative to this directory) to place at the top of
208 | # the title page.
209 | #latex_logo = None
210 | 
211 | # For "manual" documents, if this is true, then toplevel headings are parts,
212 | # not chapters.
213 | #latex_use_parts = False
214 | 
215 | # If true, show page references after internal links.
216 | #latex_show_pagerefs = False
217 | 
218 | # If true, show URL addresses after external links.
219 | #latex_show_urls = False
220 | 
221 | # Documents to append as an appendix to all manuals.
222 | #latex_appendices = []
223 | 
224 | # If false, no module index is generated.
225 | #latex_domain_indices = True
226 | 
227 | 
228 | # -- Options for manual page output ---------------------------------------
229 | 
230 | # One entry per manual page. List of tuples
231 | # (source start file, name, description, authors, manual section).
232 | man_pages = [
233 |     ('index', 'youtubefacesdb', u'YouTubeFacesDB Documentation',
234 |      [u'Julien Vitay'], 1)
235 | ]
236 | 
237 | # If true, show URL addresses after external links.
238 | #man_show_urls = False
239 | 
240 | 
241 | # -- Options for Texinfo output -------------------------------------------
242 | 
243 | # Grouping the document tree into Texinfo files. List of tuples
244 | # (source start file, target name, title, author,
245 | #  dir menu entry, description, category)
246 | texinfo_documents = [
247 |   ('index', 'YouTubeFacesDB', u'YouTubeFacesDB Documentation',
248 |    u'Julien Vitay', 'YouTubeFacesDB', 'One line description of project.',
249 |    'Miscellaneous'),
250 | ]
251 | 
252 | # Documents to append as an appendix to all manuals.
253 | #texinfo_appendices = []
254 | 
255 | # If false, no module index is generated.
256 | #texinfo_domain_indices = True
257 | 
258 | # How to display URL addresses: 'footnote', 'no', or 'inline'.
259 | #texinfo_show_urls = 'footnote'
260 | 
261 | # If true, do not generate a @detailmenu in the "Top" node's menu.
262 | #texinfo_no_detailmenu = False
263 | 
264 | #########################################
265 | # autodoc parameters
266 | #########################################
267 | 
268 | autodoc_member_order='groupwise'
269 | autoclass_content='both'
270 | 


--------------------------------------------------------------------------------
/docs/source/index.rst:
--------------------------------------------------------------------------------
 1 | YouTubeFacesDB
 2 | ==============
 3 | 
 4 | Python module allowing to load the YouTube Faces Database:
 5 | 
 6 | http://www.cs.tau.ac.il/~wolf/ytfaces/
 7 | 
 8 | **Description:** The data set contains 3,425 videos of 1,595 different
 9 | people. All the videos were downloaded from YouTube. An average of 2.15
10 | videos are available for each subject. The shortest clip duration is 48
11 | frames, the longest clip is 6,070 frames, and the average length of a
12 | video clip is 181.3 frames.
13 | 
14 | **For TUC users:** the DB is already downloaded on cortex at
15 | ``/work/biblio/youtube Faces DB`` (with the spaces). Copy it to your
16 | machine (in ``/scratch``, as it is over 25GB) and uncompress it.
17 | 
18 | **Author:** Julien Vitay julien.vitay@informatik.tu-chemnitz.de
19 | 
20 | **License:** MIT
21 | 
22 | .. toctree::
23 |    :maxdepth: 4
24 | 
25 |    tutorial
26 |    api
27 | 
28 | 
29 | 
30 | 
31 | 
32 | 
33 | 


--------------------------------------------------------------------------------
/docs/source/tutorial.rst:
--------------------------------------------------------------------------------
  1 | YouTubeFacesDB
  2 | ==============
  3 | 
  4 | Python module allowing to load the YouTube Faces Database:
  5 | 
  6 | http://www.cs.tau.ac.il/~wolf/ytfaces/
  7 | 
  8 | **Description:** The data set contains 3,425 videos of 1,595 different
  9 | people. All the videos were downloaded from YouTube. An average of 2.15
 10 | videos are available for each subject. The shortest clip duration is 48
 11 | frames, the longest clip is 6,070 frames, and the average length of a
 12 | video clip is 181.3 frames.
 13 | 
 14 | **For TUC users:** the DB is already downloaded on cortex at
 15 | ``/work/biblio/youtube Faces DB`` (with the spaces). Copy it to your
 16 | machine (in ``/scratch``, as it is over 25GB) and uncompress it.
 17 | 
 18 | **Author:** Julien Vitay julien.vitay@informatik.tu-chemnitz.de
 19 | 
 20 | **License:** MIT
 21 | 
 22 | Installation
 23 | ------------
 24 | 
 25 | Apart from the usual python (2.7) + numpy dependencies, the module
 26 | requires:
 27 | 
 28 | -  **Pillow** ``pip install Pillow --user`` for image processing.
 29 | -  **h5py** ``pip install h5py --user`` to manage the HDF5 files.
 30 |    ``libhdf5`` should also be installed through your package manager.
 31 | 
 32 | The module can then be installed locally with:
 33 | 
 34 | .. code:: bash
 35 | 
 36 |     python setup.py install --user
 37 | 
 38 | To build the documentation, you will need Sphinx
 39 | ``pip install Sphinx --user``. You can then go into the ``docs/``
 40 | directory and build it with:
 41 | 
 42 | .. code:: bash
 43 | 
 44 |     make html
 45 | 
 46 | You can then access ``docs/build/html/index.html`` with your browser.
 47 | 
 48 | Tutorial
 49 | --------
 50 | 
 51 | Transforming the YouTube Faces Database into a HDF5 file
 52 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 53 | 
 54 | An example is provided in ``examples/GenerateSubset.py``. It accesses
 55 | the dataset located at ``/scratch/vitay/Datasets/YouTubeFaces``
 56 | (``directory``), selects 10 random labels from it (``labels``), fetches
 57 | all corresponding images (``max_number``), crops them to contains only
 58 | the face area (``cropped``), transform them to luminance-based
 59 | (``color``), resizes them to (100, 100) (``size``), prepends a dummy
 60 | dimension to obtain a final numpy array of shape (1, 100, 100)
 61 | (``bw_first``) and dumps them to the HDF5 file ``ytfdb.h5``
 62 | (``filename``).
 63 | 
 64 | .. code:: python
 65 | 
 66 |     from YouTubeFacesDB import generate_ytf_database
 67 |     generate_ytf_database(  
 68 |         directory= '/scratch/vitay/Datasets/YouTubeFaces', # Location of the YTF dataset
 69 |         filename='ytfdb.h5', # Name of the HDF5 file to write to
 70 |         labels=10, # Number of labels to randomly select
 71 |         max_number=-1, # Maximum number of images to use
 72 |         size=(100, 100), # Size of the images
 73 |         color=False, # Black and white
 74 |         bw_first=True, # Final shape is (1, w, h)
 75 |         cropped=True # The original images are cropped to the faces
 76 |     )
 77 | 
 78 | Check the doc of ``generate_ytf_database`` to see other arguments to
 79 | this function.
 80 | 
 81 | **Beware:** if you try to generate all color images of all labels with a
 82 | size (100, 100), the process will take over half an hour and the HDF5
 83 | file will be over 50GB, so do not save it in your home directory.
 84 | 
 85 | Loading the HDF5 file for usage in Python
 86 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 87 | 
 88 | Once the HDF5 file has been generated, you can use it in a Python for
 89 | learning. An example is provided in ``examples/TrainKeras.py``, where a
 90 | convolutional network written in Keras
 91 | (``pip install Theano --user && pip install keras --user``) is trained
 92 | on the data contained in ``ytfdb.h5``.
 93 | 
 94 | Loading the dataset into memory
 95 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 96 | 
 97 | To load the data, you need to create a ``YouTubeFacesDB`` object, pass
 98 | him the path the HDF5 file and call the ``get()`` option:
 99 | 
100 | .. code:: python
101 | 
102 |     from YouTubeFacesDB import YouTubeFacesDB
103 |     db = YouTubeFacesDB('ytfdb.h5')
104 |     X, y = db.get()
105 | 
106 | ``X`` is a numpy array containing all input images. The first index
107 | correspond to the image number, the remaining ones to the shape of the
108 | numpy array representing each image. This information can also be
109 | retrieved through the attributes of the object:
110 | 
111 | .. code:: python
112 | 
113 |     N = db.nb_samples # number of samples, e.g. 10000
114 |     d = db.input_dim # shape of the images, e.g. (1, 100, 100)
115 | 
116 | ``y`` is a numpy array containing the label index for each image (in
117 | vectorized form, see *categorical outputs*). You can access the number
118 | of labels, as well as the list of labels easily:
119 | 
120 | .. code:: python
121 | 
122 |     C = db.nb_classes # Number of classes
123 |     labels = db.labels # List of strings for the labels
124 | 
125 | Transforming the data
126 | ^^^^^^^^^^^^^^^^^^^^^
127 | 
128 | **Mean removal**
129 | 
130 | ``X`` contains for each pixel a floating value between 0. and 1. (the
131 | conversion between integers [0..255] and floats [0...1] was done during
132 | the generation process). However, neural networks typically work much
133 | better when the input data has a zero mean. Fortunately, the mean input
134 | (i.e. the mean face) was also saved during the generation process. You
135 | can remove it from the input using:
136 | 
137 | .. code:: python
138 | 
139 |     mean_face = db.mean
140 |     X -= mean_face
141 | 
142 | You can also tell the ``YouTubeFacesDB`` object to remove systematically
143 | this mean from the inputs:
144 | 
145 | .. code:: python
146 | 
147 |     db = YouTubeFacesDB('ytfdb.h5', mean_removal=True)
148 |     X, y = db.get()
149 | 
150 | This way, ``X`` has a zero mean over the first axis, without needing to
151 | explicitly compute it. This is particularly useful when generating
152 | minibatches.
153 | 
154 | **Categorical outputs**
155 | 
156 | The outputs labels are originally integers between 0 and
157 | ``db.nb_classes`` - 1. To train neural networks, it often required to
158 | represent the output as binary arrays of length ``db.nb_classes``. where
159 | only one element is 1 and the rest 0. For example, the third class among
160 | 10 would be represented by ``0000000100``.This is the default
161 | representation returned by the ``YouTubeFacesDB`` object.
162 | 
163 | If you prefer to get the labels as integers in ``y``, you can specify it
164 | in the constructor:
165 | 
166 | .. code:: python
167 | 
168 |     db = YouTubeFacesDB('ytfdb.h5', output_type='integer')
169 | 
170 | The default value of ``output_type`` is ``vector``.
171 | 
172 | Splitting the data into training, validation and test sets
173 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
174 | 
175 | ``db.get()`` returns by default the whole data. If you want to split
176 | this data into training, validation and test sets, you can call the
177 | method ``split_dataset()``:
178 | 
179 | .. code:: python
180 | 
181 |     db.split_dataset(validation_size=0.2, test_size=0.1)
182 | 
183 | In this example, the validation set will contain 20% of the samples and
184 | the test set 10%. The rest stays in the training set. The samples are
185 | randomly chosen in the data. To retrieve the corresponding data, provide
186 | an argument to ``get()``:
187 | 
188 | .. code:: python
189 | 
190 |     db.split_dataset(validation_size=0.2, test_size=0.1)
191 |     X_train, y_train = db.get('train')
192 |     X_val, y_val = db.get('val')
193 |     X_test, y_test = db.get('test')
194 | 
195 | By default, the validation set has 20% of the data and the test set 0%.
196 | 
197 | Generating minibatches
198 | ^^^^^^^^^^^^^^^^^^^^^^
199 | 
200 | Loading the whole dataset in memory with ``get()`` defeats the purpose
201 | of storing a large-scale dataset in a HDF5 file. In practice, it is
202 | recommended to load only minibatches (of let's say 1000 samples) one at
203 | a time, process them, and ask for a new one.
204 | 
205 | The method ``generate_batches()`` returns a generator allowing to loop
206 | over a dataset and retrieve the data ``(X, y)`` for each minibatch:
207 | 
208 | .. code:: python
209 | 
210 |     for X, y  in db.generate_batches(batch_size=100, dset='train', rest=True):
211 |         do_something(X, y)    
212 | 
213 | ``batch_size`` defines how many samples will be in each minibatch,
214 | ``dset`` from which dataset the samples will be taken
215 | (``['all', 'train', 'val', 'test']``) and ``rest`` what should be done
216 | with the last samples if the total number of samples is not a multiple
217 | of the batch size. For example, if the dataset has 1537 samples, and the
218 | batch size is 100, the ``for`` loop will be executed 15 times. The
219 | remaining 37 samples will be returned only if ``rest`` is set to True
220 | (as smaller batches may cause rpoblems with some tensor libraries).
221 | 
222 | Between two calls to ``generate_batches()``, the indices are shuffled,
223 | so the minibatches will never be identical between epochs.
224 | 
225 | The example in ``examples/TrainKeras-Generator.py`` shows how to use
226 | minibatches with Keras. Strangely, the ``fit_generator()`` method of
227 | Keras does not work with this generator, as Keras runs the generator in
228 | a separate thread and the h5py module does not seem to like it...
229 | 


--------------------------------------------------------------------------------
/examples/GenerateSubset.py:
--------------------------------------------------------------------------------
 1 | from YouTubeFacesDB import generate_ytf_database
 2 | 
 3 | ###############################################################################
 4 | # Create the dataset
 5 | ###############################################################################
 6 | generate_ytf_database(  
 7 |     directory= '../data',#'/scratch/vitay/Datasets/YouTubeFaces', # Location of the YTF dataset
 8 |     filename='ytfdb.h5', # Name of the HDF5 file to write to
 9 |     labels=10, # Number of labels to randomly select
10 |     max_number=-1, # Maximum number of images to use
11 |     size=(100, 100), # Size of the images
12 |     color=False, # Black and white
13 |     bw_first=True, # Final shape is (1, w, h)
14 |     cropped=True # The original images are cropped to the faces
15 | )


--------------------------------------------------------------------------------
/examples/TrainKeras-Generator.py:
--------------------------------------------------------------------------------
 1 | from __future__ import print_function, with_statement
 2 | from time import time
 3 | 
 4 | from YouTubeFacesDB import YouTubeFacesDB
 5 | 
 6 | from keras.models import Sequential
 7 | from keras.layers.core import Dense, Dropout, Activation, Flatten
 8 | from keras.layers.convolutional import Convolution2D, MaxPooling2D
 9 | from keras.optimizers import SGD, Adam, RMSprop, Adagrad, Adadelta
10 | from keras.regularizers import l2, activity_l2
11 | from keras.utils import np_utils
12 | 
13 | 
14 | ###############################################################################
15 | # Load the data from disk
16 | ###############################################################################
17 | tstart = time()
18 | 
19 | db = YouTubeFacesDB('ytfdb.h5', mean_removal=True, output_type='vector')
20 | N = db.nb_samples
21 | d = db.input_dim
22 | C = db.nb_classes
23 | mean_face = db.mean
24 | 
25 | print(N, 'images of size', d, 'loaded in', time()-tstart)
26 | 
27 | ###############################################################################
28 | # Split into a training set and a test set 
29 | ###############################################################################
30 | db.split_dataset(validation_size=0.25)
31 | 
32 | ###############################################################################
33 | # Train a not very deep network
34 | ###############################################################################
35 | print('Create the network...')
36 | model = Sequential()
37 | 
38 | # Convolutional input layer with maxpooling and dropout
39 | model.add(Convolution2D(16, 6, 6, border_mode='valid', input_shape=d))
40 | model.add(Activation('relu'))
41 | model.add(MaxPooling2D(pool_size=(2, 2)))
42 | model.add(Dropout(0.5))
43 | 
44 | # Fully connected with ReLU and dropout
45 | model.add(Flatten())
46 | model.add(Dense(100))
47 | model.add(Activation('relu'))
48 | model.add(Dropout(0.5))
49 | 
50 | # Softmax output layer
51 | model.add(Dense(C))
52 | model.add(Activation('softmax'))
53 | 
54 | # Learning rule
55 | optimizer = SGD(lr=0.01, decay=1e-6, momentum=0.9, nesterov=True)
56 | model.compile(loss='categorical_crossentropy', optimizer=optimizer)
57 | 
58 | # Training
59 | print('Start training...')
60 | nb_epochs = 10
61 | batch_size = 100
62 | try:
63 | 	for epoch in range(nb_epochs):
64 | 		print('Epoch', epoch+1, '/', nb_epochs)
65 | 		tstart = time()
66 | 		# Training
67 | 		batch_generator = db.generate_batches(batch_size, dset='train')
68 | 		nb_train_batches = 0; train_loss = 0.0; train_accuracy = 0.0
69 | 		for X, y in batch_generator:
70 | 			loss, accuracy = model.train_on_batch(X, y, accuracy=True)
71 | 			train_loss += loss
72 | 			train_accuracy += accuracy
73 | 			nb_train_batches += 1
74 | 		# Validation
75 | 		batch_generator = db.generate_batches(batch_size, dset='val')
76 | 		nb_val_batches = 0; val_loss = 0.0; val_accuracy = 0.0
77 | 		for X, y in batch_generator:
78 | 			loss, accuracy = model.test_on_batch(X, y, accuracy=True)
79 | 			val_loss += loss
80 | 			val_accuracy += accuracy
81 | 			nb_val_batches += 1
82 | 		# Verbose
83 | 		print('\tTraining loss:', train_loss/float(nb_train_batches), 'accuracy:', train_accuracy/float(nb_train_batches))
84 | 		print('\tValidation loss:', val_loss/float(nb_val_batches), 'accuracy:', val_accuracy/float(nb_val_batches))
85 | 		print('\tTook', time()-tstart)
86 | 
87 | except (KeyboardInterrupt, ):
88 | 	pass
89 | 
90 | # Validation
91 | print('Training finished.')
92 | batch_generator = db.generate_batches(batch_size, dset='val')
93 | nb_val_batches = 0; val_loss = 0.0; val_accuracy = 0.0
94 | for X, y in batch_generator:
95 | 	loss, accuracy = model.test_on_batch(X, y, accuracy=True)
96 | 	val_loss += loss
97 | 	val_accuracy += accuracy
98 | 	nb_val_batches += 1
99 | print('Test loss:', val_loss/float(nb_val_batches), 'accuracy:', val_accuracy/float(nb_val_batches))


--------------------------------------------------------------------------------
/examples/TrainKeras.py:
--------------------------------------------------------------------------------
 1 | from __future__ import print_function, with_statement
 2 | from time import time
 3 | 
 4 | from YouTubeFacesDB import YouTubeFacesDB
 5 | 
 6 | from keras.models import Sequential
 7 | from keras.layers.core import Dense, Dropout, Activation, Flatten
 8 | from keras.layers.convolutional import Convolution2D, MaxPooling2D
 9 | from keras.optimizers import SGD, Adam, RMSprop, Adagrad, Adadelta
10 | from keras.regularizers import l2, activity_l2
11 | from keras.utils import np_utils
12 | 
13 | 
14 | ###############################################################################
15 | # Load the data from disk
16 | ###############################################################################
17 | tstart = time()
18 | 
19 | db = YouTubeFacesDB('ytfdb.h5', mean_removal=True, output_type='vector')
20 | N = db.nb_samples
21 | d = db.input_dim
22 | C = db.nb_classes
23 | 
24 | print(N, 'images of size', d, 'loaded in', time()-tstart)
25 | 
26 | ###############################################################################
27 | # Split into a training set and a test set 
28 | ###############################################################################
29 | db.split_dataset(validation_size=0.25)
30 | X_train, y_train = db.get('train')
31 | X_test, y_test = db.get('val')
32 | 
33 | ###############################################################################
34 | # Train a not very deep network
35 | ###############################################################################
36 | print('Create the network...')
37 | model = Sequential()
38 | 
39 | # Convolutional input layer with maxpooling and dropout
40 | model.add(Convolution2D(16, 6, 6, border_mode='valid', input_shape=d))
41 | model.add(Activation('relu'))
42 | model.add(MaxPooling2D(pool_size=(2, 2)))
43 | model.add(Dropout(0.5))
44 | 
45 | # Fully connected with ReLU and dropout
46 | model.add(Flatten())
47 | model.add(Dense(100))
48 | model.add(Activation('relu'))
49 | model.add(Dropout(0.5))
50 | 
51 | # Softmax output layer
52 | model.add(Dense(C))
53 | model.add(Activation('softmax'))
54 | 
55 | # Learning rule
56 | optimizer = SGD(lr=0.01, decay=1e-6, momentum=0.9, nesterov=True)
57 | model.compile(loss='categorical_crossentropy', optimizer=optimizer)
58 | 
59 | # Training
60 | print('Start training...')
61 | try:
62 | 	model.fit(X_train, y_train,
63 |           batch_size=100, nb_epoch=10,
64 |           show_accuracy=True, verbose=2,
65 |           validation_data=(X_test, y_test))
66 | except (KeyboardInterrupt, ):
67 | 	pass
68 | 
69 | # Test on the validation set
70 | score = model.evaluate(X_test, y_test,
71 |                        show_accuracy=True, verbose=2)
72 | 
73 | print('Training finished.')
74 | print('Test score:', score[0])
75 | print('Test accuracy:', score[1])


--------------------------------------------------------------------------------
/setup.py:
--------------------------------------------------------------------------------
 1 | import os
 2 | from setuptools import setup
 3 | 
 4 | # Utility function to read the README file.
 5 | # Used for the long_description.  It's nice, because now 1) we have a top level
 6 | # README file and 2) it's easier to type in the README file than to put a raw
 7 | # string in below ...
 8 | def read(fname):
 9 |     return open(os.path.join(os.path.dirname(__file__), fname)).read()
10 | 
11 | setup(
12 |     name = "YouTubeFacesDB",
13 |     version = "0.0.1",
14 |     author = "Julien Vitay",
15 |     author_email = "julien.vitay@gmail.com",
16 |     description = ("Python scripts to load the YouTube Faces Database."),
17 |     license = "MIT",
18 |     keywords = "youtube faces database",
19 |     url = "https://ai.informatik.tu-chemnitz.de/gogs/vitay/YouTubeFacesDB",
20 |     packages=['YouTubeFacesDB'],
21 |     long_description=read('README.md'),
22 |     classifiers=[
23 |         "Development Status :: 3 - Alpha",
24 |         "Topic :: Utilities",
25 |         "License :: OSI Approved :: MIT License",
26 |     ],
27 | )


--------------------------------------------------------------------------------