YouTube Faces


YouTubeFaces
Overview:

Welcome to YouTube Faces Database, a database of face videos designed for studying the problem of unconstrained face recognition in videos.
The data set contains 3,425 videos of 1,595 different people. All the videos were downloaded from YouTube. An average of 2.15 videos are available for each subject. The shortest clip duration is 48 frames, the longest clip is 6,070 frames, and the average length of a video clip is 181.3 frames.

Number of videos per person:
#videos 1 2 3 4 5 6
#people 591 471 307 167 51 8


In designing our video data set and benchmarks we follow the example of the 'Labeled Faces in the Wild' LFW image collection. Specifically, our goal is to produce a large scale collection of videos along with labels indicating the identities of a person appearing in each video.
In addition, we publish benchmark tests, intended to measure the performance of video pair-matching techniques on these videos.
Finally, we provide descriptor encodings for the faces appearing in these videos, using well established descriptor methods.

Errata:

We were recently sent a list of errors that occurred during the labeling process. This information is provided as two files: YTFErrors.csv and splits_corrected.txt. We will later on publish recommendations for reporting results obtained on the corrected data set.

Download the database:

Please provide the details below. The reason we ask for them is that we would like to keep in touch in case we find any need for a critical update or in case we would organize a dedicated workshop, etc. When done, you will be able to see the FTP password. Thanks for your cooperation.

* If you have trouble viewing or submitting this form, you can fill it out online here.

NEW OPTION: direct your browser to the web frontend at http://www.cslab.openu.ac.il/agas/
EVEN NEWER OPTION (*recommended*): direct your browser to the new web frontend http://www.cslab.openu.ac.il/download/

Username and password provided by the above form. A tar file with the entire database YouTubeFaces.tar.gz (25GB, md5sum: b41287d713703c542a39255393e1bf93) which contains all the following:

1. WolfHassnerMaoz_CVPR11.pdf (389KB, md5sum: 9299b18ae2308b0ae692d89ab280681e):
The paper explaining this work, how the database was created, the benchmarks, the Matched Background Similarity (MBGS) method, etc.

2. frame_images_DB.tar.gz (12GB, md5sum: 5859fa8bc72c61bbb7f1444b1a18ab40):
Contains the videos downloaded from youtube, broken to frames.
The directory structure is:
subject_name\video_number\video_number.frame.jpg

For each person in the database there is a file called subject_name.labeled_faces.txt
The data in this file is in the following format:
filename,[ignore],x,y,width,height,[ignore],[ignore]
where:
x,y are the center of the face and the width and height are of the rectangle that the face is in.
For example:
$ head -3 Richard_Gere.labeled_faces.txt
Richard_Gere\3\3.618.jpg,0,262,165,132,132,0.0,1
Richard_Gere\3\3.619.jpg,0,260,164,131,131,0.0,1
Richard_Gere\3\3.620.jpg,0,261,165,129,129,0.0,1

3. aligned_images_DB.tar.gz (5.4GB, md5sum: 915accad71cd59c8d14399686e0c91f9):
Similar to frame_images_DB, contains the videos downloaded from youtube broken to frames, but after some manipulation:
a. face detection, expanding the bounding box by 2.2 and cropping from the frame.
b. alignment.
The directory structure is:
subject_name\video_number\aligned_detect_video_number.frame.jpg

4. descriptors_DB.tar.gz (7.9GB, md5sum: cb25c4099f14c10e6c88a4bc1e6bb1a9):
Contains mat files with the descriptors of the frames.
The directory structure is:
subject_name\mat files

For each video there are two files:
aligned_video_1.mat
video_1.mat

The files contains descriptors per frame, several descriptors type per frame.
One contains the aligned version of the faces in the frame and the other contain the not aligned version.
Each of the above file has a struct with the following (for example a video with 80 frames):

VID_DESCS_FPLBP: [560x80 double]
VID_DESCS_LBP: [1770x80 double]
VID_DESCS_CSLBP: [480x80 double]
VID_DESCS_FILENAMES: {1x80 cell}

These are the different descriptors and the file names.

5. meta_data.tar.gz (132MB, md5sum: ):
Contains the meta_and_splits.mat file, which provides an easy way for accessing the mat files in the descriptors DB. The Splits is a data structure dividing the data set to 10 independent splits.
Each triplet in the Splits is in the format of (index1, index2, is_same_person), where index1 and index2 are the indices in the mat_names structure. All together 5000 pairs divided equally to 10 independent splits, with 2500 same pairs and 2500 not-same pairs.

video_labels: [1x3425 double]
video_names: {3425x1 cell
mat_names: {3425x1 cell}
Splits: [500x3x10 double]

6. headpose_DB.tar.gz (5.8MB, md5sum: 05efac9e4ae1a9ed28c2ed09bdbdd137):
Contains mat files with the three rotation angles of the head for each frame in the data set.
The directory structure is:
headorient_apirun_subject_name_video_number.mat

Each mat file contains a struct with the following:
headpose: [3x60 double]

In addition, we provide the sources.tar.gz (5.6MB, md5sum: a965855464902af63dd44f16471025ab) for running the benchmark tests and implementation of all methods. See sources' README.

Database Creation:

Collection setup:
We begin by using the 5,749 names of subjects included in the LFW data set to search YouTube for videos of these same individuals. The top six results for each query were downloaded.
We minimize the number of duplicate videos by considering two videos' names with edit distance less than 3 to be duplicates.
Downloaded videos are then split to frames at 24fps. We detect faces in these videos using the Viola-Jones face detector. Automatic screening was performed to eliminate detections of less than 48 consecutive frames, where detections were considered consecutive if the Euclidean distance between their detected centers was less than 10 pixels. This process ensures that the videos contain stable detections and are long enough to provide useful information for the various recognition algorithms.
Finally, the remaining videos were manually verified to ensure that (a) the videos are correctly labeled by subject, (b) are not semi-static, still-image slide-shows, and (c) no identical videos are included in the database.

Database encodings:
All video frames are encoded using several well-established, face-image descriptors. Specifically, we consider the face detector output in each frame.
The bounding box around the face is expanded by 2.2 of its original size and cropped from the frame.
The result is then resized to standard dimensions of 200x200 pixels.
We then crop the image again, leaving 100x100 pixels centered on the face. Following a conversion to grayscale, the images are aligned by fixing the coordinates of automatically detected facial feature points, and we apply the following descriptors: Local Binary Patterns (LBP), Center-Symmetric LBP (CSLBP) and Four-Patch LBP. The image is divided to a fixed grid of blocks. The descriptors of each block are normalized to a unit Euclidean length.

Benchmark tests:
Following the example of the LFW benchmark, we provide standard, ten-fold, cross validation, pair-matching ('same'/'not-same') tests.
Specifically, we randomly collect 5,000 video pairs from the database, half of which are pairs of videos of the same person, and half of different people. These pairs were divided into 10 splits.
Each split containing 250 'same' and 250 'not-same' pairs. Pairs are divided ensuring that the splits remain subject mutually exclusive; if videos of a subject appear in one split, no video of that subject is included in any other split.
The goal of this benchmark is to determine, for each split, which are the same and which are the non-same pairs, by training on the pairs from the nine remaining splits. We note that this split design encourages classification techniques to learn what makes faces similar or different, rather than learn the appearance properties of particular individuals.
One may consider two test protocols. First, the restricted protocol limits the information available for training to the same/not-same labels in the training splits. Currently there are results only for the restricted protocol.
The Unrestricted protocol, on the other hand, allows training methods access to subject identity labels, which has been shown in the past to improve recognition results in the LFW benchmark.
View the splits here: splits.txt.

Reference:

If you use this database, or refer to its results, please cite the following paper:

Lior Wolf, Tal Hassner and Itay Maoz
Face Recognition in Unconstrained Videos with Matched Background Similarity.

IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2011.
pdf

Contact:

Lior Wolf The Blavatnik School of Computer Science, Tel-Aviv University, Israel
Tal Hassner Computer Science Division, The Open University of Israel
Itay Maoz The Blavatnik School of Computer Science, Tel-Aviv University, Israel

Change History:

There was a problem with confusing 0-based and 1-based indices. Fixed on Oct 31, 2012