MegaFace 百万级人脸识别数据集

Name: MegaFace 百万级人脸识别数据集
Creator: 帕依提提
License: 暂无描述

帕依提提2024-03-04 收录

下载链接：

https://www.payititi.com/opendatasets/show-645.html

下载链接

链接失效反馈

官方服务：

资源简介：

In total, once clustered and optimized MF2 contains 4,753,320 faces and 672,057 identities. On average this is 7.07 photos per identity, with a minimum of 3 photos per identity, and maximum of 2469. We expanded the tight crop version by re-downloading the clustered faces and saving a loosely cropped version. The tightly cropped dataset requires 159GB of space, while the loosely cropped is split into 14 files each requiring 65GB for a total of 910GB. In order to gain statistics on age and gender, we ran the WIKI-IMDB models for age and gender detection over the loosely cropped version of the data set. We found that females accounted for 41.1% of subjects while males accounted for 58.8%. The median gender variance within identities was 0. The average age range to be 16.1 years while the median was 12 years within identities. The distributions can be found in the supplementary material. A trade off of this algorithm is that we must strike a balance between noise and quantity of data with the parameters. It has been noted by the VGG-Face work, that given the choice between a larger, more impure data set, and a smaller hand-cleaned data set, the larger can actually give better performance. A strong reason foropting to remove most faces from the initial unlabeled corpus was detection error. We found that many images were actually non-faces. There were also many identities that did not appear more than once, and these would not be as useful for learning algorithms. By visual inspection of 50 randomly thrown out faces by the algorithm: 14 were non faces, 36 were not found more than twice in their respective Flickr accounts. In a complete audit of the clustering algorithm, the reason for throwing out faces are follows: 69% Faces which were below the To create a data set that includes hundreds of thousands of identities we utilize the massive collection of Creative Commons photographs released by Flickr. This set contains roughly 100M photos and over 550K individual Flickr accounts. Not all photographs in the data set contain faces. Following the MegaFace challenge, we sift through this massive collection and extract faces detected using DLIB’s face detector. To optimize harddrive space for millions of faces, we only saved the crop plus 2 % of the cropped area for further processing. After collecting and cleaning our fifinal data set, we re-download the fifinal faces at a higher crop ratio (70%). As the Flickr data is noisy and has sparse identities (with many examples of single photos per identity, while we are targeting multiple photos per identity), we processed the full 100M Flickr set to maximize the number of identities. We therefore employed a distributed queue system, RabbitMQ, to distribute face detection work across 60 compute nodes which we save locally. A second collection process aggregates faces to a single machine. In order to optimize for Flickr accounts with a higher possibility of having multiple faces of the same identity, we ignore all accounts with less than 30 photos. In total we obtained 40M unlabeled faces across 130,154 distinct Flickr accounts (representing all accounts with more than 30 face photos). The crops of photos take over 1TB of storage. As the photos are taken with different camera settings, photos range in size from low resolution (90x90px) to high resolution (800x800+px). In total the distributed process of collecting and aggregating photos took 15 days. Labeling million-scale data manually is challenging and while useful for development of algorithms, there are almost no approaches on how to do it while controlling costs. Companies like MobileEye, Tesla, Facebook, hire thousands of human labelers, costing millions of dollars. Additionally, people make mistakes and get confusedwith face recognition tasks, resulting in a need to re-test and validate further adding to costs. We thus look to automated, or semi-automated methods to improve the purity of collected data. There has been several approaches for automated cleaning of data. O. M. Parkhi et al. used near-duplicate removal to improve data quality. G. Levi et al. used age and gender consistency measures. T. L. Berg et al. and X. Zhang et al. included text from news captions describing celebrity names. H.-W Ng et al. propose data cleaning as aquadratic programming problem with constraints enforcing assumptions that noise consists of a relatively small portion of the collected data, gender uniformity, identities consistof a majority of the same person, and a single photo cannot have two of the same person in it. All those methods proved to be important for data cleaning given rough initial labels, e.g., the celebrity name. In our case, rough labels are not given. We do observe that face recognizers perform well at a small scale and leverage embeddings to provide ameasure of similarity to further be used for labeling. Please use the following citation when referencing the dataset:

经聚类与优化后，MF2数据集总计包含4,753,320张人脸与672,057个身份。平均每个身份对应7.07张照片，最少为3张，最多可达2469张。我们通过重新下载聚类后的人脸并保存宽松裁剪版本，对紧裁剪版本进行了扩展。紧裁剪数据集占用空间为159GB，而宽松裁剪版本被拆分为14个文件，单个文件大小为65GB，总占用空间达910GB。为获取年龄与性别统计数据，我们在数据集的宽松裁剪版本上运行了WIKI-IMDB年龄性别检测模型。结果显示，女性受试者占比41.1%，男性占比58.8%。各身份内的性别差异中位数为0；身份内的平均年龄跨度为16.1岁，年龄跨度中位数为12岁。相关分布可参见补充材料。本算法存在一项权衡：需通过参数在数据噪声与数据体量之间寻求平衡。VGG-Face相关研究指出，相较于规模更小的人工清洗数据集，规模更大但纯度更低的数据集反而能取得更优的模型性能。我们选择从初始未标注语料库中移除多数人脸的核心原因之一是检测误差。我们发现大量图像实际并非人脸，同时存在许多仅出现一次的身份，这类数据对学习算法的帮助有限。我们对算法随机剔除的50张人脸进行人工检视：其中14张并非人脸，36张在对应弗利克（Flickr）账户中出现次数未超过两次。在对聚类算法的完整审计中，剔除人脸的原因如下：69%的人脸属于低于[原文此处未完整表述]。为构建包含数十万个身份的数据集，我们利用了弗利克（Flickr）发布的海量知识共享（Creative Commons）照片库。该库包含约1亿张照片与超过55万个独立弗利克账户。并非所有照片都包含人脸。参照MegaFace挑战赛的流程，我们对这一海量数据集进行筛选，使用DLIB人脸检测器提取人脸区域。为优化百万级人脸数据的硬盘存储空间，我们仅保存裁剪区域外加2%的周边区域用于后续处理。在完成最终数据集的收集与清洗后，我们以更高的裁剪比例（70%）重新下载最终人脸数据。由于弗利克数据集存在噪声且身份分布稀疏（大量身份仅对应单张照片，而我们的目标是每个身份拥有多张照片），我们对全部1亿张弗利克照片进行处理以最大化身份数量。为此，我们采用分布式队列系统RabbitMQ，将人脸检测任务分发至60个本地计算节点。随后通过二级采集流程将人脸数据聚合至单台机器。为优先处理更有可能包含同一身份多张人脸的弗利克账户，我们过滤掉所有照片数量少于30张的账户。最终我们从130,154个符合条件的弗利克账户（即拥有超过30张人脸照片的账户）中获取了4000万张未标注人脸。人脸裁剪数据占用存储空间超过1TB。由于照片由不同相机拍摄，图像分辨率跨度从低分辨率（90×90像素）至高分辨率（800×800+像素）不等。整个人脸数据收集与聚合的分布式流程耗时15天。百万级数据的人工标注极具挑战性：尽管其对算法研发具有价值，但目前几乎没有兼顾成本控制的标注方案。诸如MobileEye、特斯拉（Tesla）、脸书（Facebook）等企业需雇佣数千名人工标注员，成本高达数百万美元。此外，人工在人脸识别任务中易出现失误与混淆，需要额外进行复测与验证，进一步推高了成本。因此，我们转向自动化或半自动化方法以提升采集数据的纯度。目前已有多种自动化数据清洗方法：O. M. Parkhi等人采用近重复移除技术提升数据质量；G. Levi等人使用年龄与性别一致性校验手段；T. L. Berg与X. Zhang等人则借助新闻标题中的文本信息识别名人姓名；H.-W. Ng等人将数据清洗建模为带约束的二次规划问题，其约束基于以下假设：噪声在采集数据中占比相对较低、性别分布均匀、身份主体多为同一人、单张照片中无法出现同一人的两张人脸。在存在粗略初始标注（如名人姓名）的场景下，上述方法均对数据清洗起到了关键作用。而在本数据集的构建中，我们并未获得任何粗略初始标注。我们观察到，人脸识别模型在小规模场景下表现优异，因此我们利用嵌入向量（embedding）生成相似度度量，以辅助后续标注工作。引用该数据集时，请使用以下著录格式：

提供机构：

帕依提提

搜集汇总

数据集介绍

背景与挑战

背景概述

MegaFace是一个百万级人脸识别数据集，包含超过470万张人脸和67万个身份，平均每个身份有7张照片。数据集分为紧密裁剪和松散裁剪两个版本，分别占用159GB和910GB的存储空间。数据集来源于Flickr的Creative Commons照片，并经过自动化和半自动化的清理和标注方法。

以上内容由遇见数据集搜集并总结生成