FaceCaption-15M

Name: FaceCaption-15M
Creator: maas
Published: 2026-04-28 11:13:05
License: 暂无描述

魔搭社区2026-04-28 更新2024-08-31 收录

下载链接：

https://modelscope.cn/datasets/AI-ModelScope/FaceCaption-15M

下载链接

链接失效反馈

官方服务：

资源简介：

# FacaCaption-15M You need to first download the data from here and then apply for access to the original Laion-face dataset by completing the required agreement (github). Once approved, refer to the information available on HuggingFace to obtain the corresponding image-text pairs. **[25/06/09] 🤗The Original Images, are Released [Completing Agreement](https://github.com/ddw2AIGROUP2CQUPT/Large-Scale-Multimodal-Face-Datasets)** ![](https://camo.githubusercontent.com/9f19143c491fa808f3867162e3fb5fb22f7a935a5bc564e1dcadb0cf82420f39/68747470733a2f2f696d672e797574616e676c692e6e65742f696d672f3230323430333138313030363938312e706e67) ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64259db7d3e6fdf87e4792d0/ZG8-YO8kNbzl9JQUFTwu9.png) **FaceCaption-15M, a large-scale, diverse, and high-quality dataset of facial images accompanied by their natural language descriptions (facial image-to-text). This dataset aims to facilitate a study on face-centered tasks. FaceCaption-15M comprises over 15 million pairs of facial images and their corresponding natural language descriptions of facial features, making it the largest facial image caption dataset to date.** # News and Updates 🔥🔥🔥： **[25/08/03] 1M Face image-text pairs are released! [FaceCaption-1M-image-text-pairs](https://huggingface.co/datasets/OpenFace-CQUPT/FaceCaption-1M-image-text-pairs)** **[24/09/16] 🤗[HumanCaption-10M](https://huggingface.co/datasets/OpenFace-CQUPT/HumanCaption-10M), is released!👏👏👏** **[24/09/01] The embeddings of images in FaceCaption-15M has been released! [OpenFace-CQUPT/Facecaption-15M-Embeddings](https://huggingface.co/datasets/OpenFace-CQUPT/Facecaption-15M-Embeddings)** **[24/07/17] The checkpoint has been released! [OpenFace-CQUPT/FLIP](https://huggingface.co/OpenFace-CQUPT/FLIP)** **[24/07/06] update citation.** **[24/07/05] FaceCaption-15M-V1 has been released!** FaceCaption-15M-V1 just contains the items of url, face box, laion_caption, face_caption and so on. **Preview 1: HumanCaption-10M [Released!]**: We are about to release the V2 version(HumanCaption), which contains not only the face image description, but also short caption and detail caption for the original image respectively. Short caption is limited to 70 words for diffusion model training and fine-tuning, and detail caption is limited to 300 words for multi-modal large model training and fine-tuning. ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64259db7d3e6fdf87e4792d0/r-EveWy-R7akMI6QvpXrH.png) **Preview 2: HumanCaption-HQ**: We extracted about 5M high-resolution image samples based on the V2 version, fusing the face detail description and the image description of GPT4o. The caption is limited to 500 words, applicable to the supervised fine-tuning stage. Paper, Code and futher dataset coming soon, please stay tuned! # How to use: ```python # when you use Datasets library: from datasets import load_dataset ds = load_dataset("OpenFace-CQUPT/FaceCaption-15M") # when you use pandas library: import pandas as pd df = pd.read_parquet("hf://datasets/OpenFace-CQUPT/FaceCaption-15M/FaceCaption-v1.parquet") ``` # Facial language image pretraining (FLIP) model Based on FaceCaption-15M, we trained a multimodal representation model [FLIP](https://github.com/ddw2AIGROUP2CQUPT/FaceCaption-15M), similar in concept to CLIP, designed for aligning facial images with semantics. FLIP contains the following components: (1) Image Encoder: Composed of a visual transformer, this component processes the image. (2) Text Encoder: When handling text input alone, this encoder follows the standard BERT module and uses the [CLS] token to summarize the entire sentence. In the case of multimodal input, a cross-attention layer is introduced between the self-attention layer and the feedforward network of the text encoder to fuse visual information (Image-grounded Text Encoder). To adapt to specific tasks, an [ENC] token is added to the text input, serving as the multimodal representation for the image-text pair. The complete training code and pre-trained model weights：(https://huggingface.co/OpenFace-CQUPT/Facial-language-image-pretraining-model/) # 1. Pipeline of our FaceCaption-15M construction process. ![image/png](https://cdn-uploads.huggingface.co/production/uploads/663f06e01cd68975883a353e/TCvUu0PlfC26BDbiKM5My.png) ## 1.1 Facial Images Collection **Image Collection.** Specifically, we accessed the LAION-Face dataset, which contains over 50M image-text pairs that obtained through web crawling, as our source of raw data. LAION-Face is of a considerable scale, and its image distribution closely resembles real-world distributions. Moreover, using such a dataset as our raw data source offers significant cost savings compared to manual data collection. There were limitations stemming from link expiration and network issues, as we could only access about 75% of the LAION-Face. **Face Segmentation.** For original LAION-Face dataset, we segment the image of the facial regions. First, we selected all images with faces from LAION-Face using RetinaFace model, which resulted in approximately 37M images. To obtain a high-quality facial image dataset while avoiding noise interference, we conducted cropping, alignment, and filtering of the facial images based on facial region detection boxes. Specifically, we retained only those facial regions with resolutions exceeding 128 × 128 pixels and confidence scores higher than 0.98, resulting in approximately 23M images. Importantly, to maintain image quality, we did not uniformly scale the images to the same size, resulting in varying resolutions among the collected images. ## 1.2 Facial Attributes Annotation Attributes play a pivotal role in generating the description text for facial image, thereby determining the correlation between the image and text. We designed 40 appearance attributes for facial features. Given the considerations of annotating a vast amount of data, we selected an automatic annotation method. In terms of efficiency and accuracy, we employed an open-source algorithm for predicting image attributes. To enhance the reliability of annotations, we retained only the labels predicted by the model with a probability exceeding 0.85. Furthermore, to generate more accurate natural language text, we retained samples with more than five valid predicted labels. Finally, we reduced the dataset size to 15M. ## 1.3 Facial Caption Generation: Raw Text Generation and Rewriting Since, image-text pairs in LAION-Face dataset were obtained through subtitle crawling, and the text showed a weak correlation with the accompanying image. Our aim is to generate the caption of facial images. The manual annotation, while accurate, is time-consuming and labor-intensive, making it unsuitable for constructing large-scale datasets. However, automatic methods often offer efficiency and scalability. Nevertheless, the diversity, complexity, and naturalness of description sentences generated by traditional automatic text generation methods are limited by grammatical templates. With the development of LLM, text generated by these models is endowed with high diversity nd naturalness. Here, we propose a text generation strategy that combines grammatical templates with LLM. Specifically, (1) we first input the attribute annotations generated by Section 3.2 into the designed grammatical template to generate the raw text, and then (2) we input the raw text into the LLM to generate natural, diverse, and accurate text descriptions. To ensure the generation of high-quality description text using LLM, the quality of the raw text generated by the grammatical template is paramount. Here, we adopted the probabilistic context-free grammars (PCFG) algorithm to generate the raw text as multiple short sentences, each constructed using different attributes. The performance of the LLM model itself affects the quality of the generated caption. We conducted research on existing open-source LLMs and finally selected the Qwen-7B-Chat model. ## 1.4 Statistical Analysis for FaceCaption-15M Dataset **Comparisons with other popular facial image datasets.** Symbol “#” indicates the number of samples (images or image-text pairs). Abbreviations “mRes”, “Ann”, and “mWords” denote average resolution of all images, the number of annotations, and average words of all text, respectively. Abbreviation “Align” indicates whether the image only contains faces. ![image/png](https://cdn-uploads.huggingface.co/production/uploads/663f06e01cd68975883a353e/1dbj5KMGyc80Jo0Nyeekd.png) ![image/png](https://cdn-uploads.huggingface.co/production/uploads/663f06e01cd68975883a353e/LeoFyl5yNHhy0xbKQ9BS0.png) **Image quality score distribution.** (a) BRISQUE evaluation with lower scores indicating better image quality; (b) CLIPIQA evaluation with higher scores indicating better image quality. ![image/png](https://cdn-uploads.huggingface.co/production/uploads/663f06e01cd68975883a353e/KhNW312RKn8lDsuqFSl92.png) **Text distribution.** (a) Distribution of the five categories of annotations in the FaceCaption-15M. (b) The percentage of sentences in the dataset with different word counts. (c) The number of unique 4-grams under the percentage data. (d) Illustrations of image-text pairs LAION-Face and FaceCapition-15M. FaceCaption* indicates the caption generated by grammatical template without using LLM. **Note:** The comparison with the CelebV-Text dataset is slightly unfair, as CelebV-Text is a video text description dataset, where we compare the first frame of each video as a picture of the video. [CelebV-Text](https://celebv-text.github.io/) is a great dataset, if you need a face video-text dataset, go to the corresponding Github repo. # 2. Limitations and Discussions During our research process, we constructed the FacaCaption-15M dataset. However, in the process of cleaning and producing the dataset, it is inevitable to introduce a certain degree of bias or model prejudice. In response to this, we will persistently update this dataset and strive to minimize the influence of prejudice to the greatest extent. In addition, in view of the constraints of relevant laws and regulations such as portrait rights and copyright law, although we have successfully obtained 15 million facial images from LAION, we still decide to follow the open-source release mode of the LAION dataset (that is, to publish the original link of the image, the text description after cleaning, and the position coordinates of the face in the original image). Also, if you find that your facial image exists in the dataset and you do not wish your data to be captured, shared, or used for training the model, please contact us. We will conduct a rough review of your information and stop distributing your data in the FaceCaption-15M dataset. It is worth stating that LAION is the upstream of this dataset, and we cannot request the upstream dataset to stop distributing your photos. The usage scenarios for large-scale face datasets are limited, and it appears that including wild photos of people holds more research value. We have further cleaned the HumanCaption-15M dataset of human photos in natural scenes based on FaceCaption-15M. Its textual descriptions take into account both scene descriptions and facial details. Stay tuned. Due to the special nature of the facial dataset itself, **this dataset is only allowed to be used for scientific research purposes.** # 3. Contacts mailto: 2018211556@stu.cqupt.edu.cn or dw_dai@163.com # 4. Datasets Examples **The green color is a sample of LAION and the red color is a sample of FaceCaption-15M.** ![image/png](https://cdn-uploads.huggingface.co/production/uploads/663f06e01cd68975883a353e/r9HKtA_ZCRtvIwKIZI4oC.png) # Additional Information ## Licensing Information The FaceCaption-15M dataset is released by OpenFaceCQUPT and is intended exclusively for research and educational purposes. It has been generated using publicly available models such as Qwen. Users should be aware that this data may contain inaccuracies, unsafe content, or biases, and should carefully evaluate its accuracy and suitability prior to use. OpenFaceCQUPT and its licensors provide this dataset "AS-IS," without any warranties, express or implied. The views and opinions expressed in the dataset do not necessarily reflect those of OpenFaceCQUPT. The FaceCaption-15M dataset is licensed under the Creative Commons Attribution 4.0 International License (CC-BY 4.0). The availability of this dataset does not constitute an invitation to use any of the information for any illegal or unlawful purposes, or beyond the scope of research or educational purposes.It is crucial to ensure ethical and responsible use of this dataset to prevent privacy violations and other ethical concerns. # Citation ```tex @misc{dai202415mmultimodalfacialimagetext, title={15M Multimodal Facial Image-Text Dataset}, author={Dawei Dai and YuTang Li and YingGe Liu and Mingming Jia and Zhang YuanHui and Guoyin Wang}, year={2024}, eprint={2407.08515}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2407.08515}, } ```

# FaceCaption-15M 你需先从指定链接下载数据，并通过签署相关协议（GitHub平台）申请获取原始LAION-Face数据集的访问权限。获批后，请参考HuggingFace平台上的公开信息获取对应的图文对数据。 **[2025/06/09] 🤗 原始图像已正式发布 [签署协议获取](https://github.com/ddw2AIGROUP2CQUPT/Large-Scale-Multimodal-Face-Datasets)** ![](https://camo.githubusercontent.com/9f19143c491fa808f3867162e3fb5fb22f7a935a5bc564e1dcadb0cf82420f39/68747470733a2f2f696d672e797574616e676c692e6e65742f696d672f3230323430333138313030363938312e706e67) ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64259db7d3e6fdf87e4792d0/ZG8-YO8kNbzl9JQUFTwu9.png) **FaceCaption-15M**是一个大规模、多样化且高质量的人脸图像自然语言描述数据集（人脸图文生成任务），旨在推动面向人脸的相关任务研究。该数据集包含超过1500万条人脸图像与对应人脸特征自然语言描述的图文对，是目前规模最大的人脸图像字幕数据集。 # 最新动态 🔥🔥🔥： **[2025/08/03] 100万条人脸图文对已发布！[FaceCaption-1M-image-text-pairs](https://huggingface.co/datasets/OpenFace-CQUPT/FaceCaption-1M-image-text-pairs)** **[2024/09/16] 🤗[HumanCaption-10M](https://huggingface.co/datasets/OpenFace-CQUPT/HumanCaption-10M) 正式发布！👏👏👏** **[2024/09/01] FaceCaption-15M的图像特征嵌入已发布！[OpenFace-CQUPT/Facecaption-15M-Embeddings](https://huggingface.co/datasets/OpenFace-CQUPT/Facecaption-15M-Embeddings)** **[2024/07/17] 模型Checkpoint已发布！[OpenFace-CQUPT/FLIP](https://huggingface.co/OpenFace-CQUPT/FLIP)** **[2024/07/06] 更新引用格式** **[2024/07/05] FaceCaption-15M-V1 正式发布！** FaceCaption-15M-V1仅包含图像链接、人脸框、LAION标注文本、人脸标注文本等字段。 **预览1：HumanCaption-10M [已发布！]**：我们即将推出V2版本（HumanCaption），该版本不仅包含人脸图像描述，还分别为原始图像提供简短字幕与详细字幕。其中简短字幕限制为70词，适用于扩散模型的训练与微调；详细字幕限制为300词，适用于多模态大语言模型（Large Language Model, LLM）的训练与微调。 ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64259db7d3e6fdf87e4792d0/r-EveWy-R7akMI6QvpXrH.png) **预览2：HumanCaption-HQ**：我们基于V2版本提取了约500万张高分辨率图像样本，融合了人脸细节描述与GPT-4o生成的图像描述。字幕长度限制为500词，适用于监督微调阶段。相关论文、代码及后续数据集即将推出，请持续关注！ # 使用方法： python # 当使用Datasets库时： from datasets import load_dataset ds = load_dataset("OpenFace-CQUPT/FaceCaption-15M") # 当使用Pandas库时： import pandas as pd df = pd.read_parquet("hf://datasets/OpenFace-CQUPT/FaceCaption-15M/FaceCaption-v1.parquet") # 人脸语言图像预训练（FLIP）模型基于FaceCaption-15M数据集，我们训练了一款多模态表征模型**FLIP**（https://github.com/ddw2AIGROUP2CQUPT/FaceCaption-15M），其设计理念与CLIP类似，旨在实现人脸图像与语义信息的对齐。FLIP包含以下组件：(1) 图像编码器：由视觉Transformer构成，用于处理图像数据。(2) 文本编码器：当仅处理文本输入时，该编码器遵循标准BERT模块逻辑，使用[CLS] Token对整句语义进行汇总；当处理多模态输入时，我们在文本编码器的自注意力层与前馈神经网络之间引入交叉注意力层以融合视觉信息（图像锚定文本编码器）。为适配特定任务，我们在文本输入中新增了[ENC] Token，用于表征图像-文本对的多模态语义。完整的训练代码与预训练模型权重：https://huggingface.co/OpenFace-CQUPT/Facial-language-image-pretraining-model/ # 1. FaceCaption-15M数据集构建流程 ![image/png](https://cdn-uploads.huggingface.co/production/uploads/663f06e01cd68975883a353e/TCvUu0PlfC26BDbiKM5My.png) ## 1.1 人脸图像采集 **图像采集**：具体而言，我们以LAION-Face数据集作为原始数据来源，该数据集包含超过5000万条通过网络爬虫获取的图文对。LAION-Face规模可观，且图像分布与真实世界分布高度吻合；此外，相较于人工采集数据，使用该数据集作为原始数据源可大幅节约成本。但受限于链接失效与网络问题，我们仅能访问LAION-Face中约75%的数据。 **人脸分割**：针对原始LAION-Face数据集，我们对图像中的人脸区域进行分割处理。首先，我们使用RetinaFace模型从LAION-Face中筛选出包含人脸的图像，共得到约3700万张图像。为获得高质量的人脸图像数据集并避免噪声干扰，我们基于人脸区域检测框对图像进行裁剪、对齐与过滤：仅保留分辨率高于128×128像素且置信度得分高于0.98的人脸区域，最终得到约2300万张图像。值得注意的是，为保持图像原生质量，我们未将所有图像统一缩放至相同尺寸，因此采集到的图像分辨率存在差异。 ## 1.2 人脸属性标注属性在生成人脸图像描述文本的过程中发挥着关键作用，直接决定了图像与文本的关联程度。我们为面部特征设计了40项外观属性。考虑到需要标注大规模数据，我们选择了自动标注方案：从效率与精度综合考量，我们采用了开源的图像属性预测算法。为提升标注可靠性，我们仅保留模型预测概率高于0.85的标签；此外，为生成更精准的自然语言文本，我们保留了拥有至少5个有效预测标签的样本。最终，我们将数据集规模缩减至1500万条。 ## 1.3 人脸字幕生成：原始文本生成与重写由于LAION-Face数据集中的图文对通过字幕爬虫获取，其文本与对应图像的关联度较弱，而我们的目标是生成人脸图像的精准字幕。人工标注虽精准，但耗时耗力，不适用于大规模数据集的构建；自动生成方法虽具备效率与可扩展性优势，但传统自动文本生成方法生成的描述句子在多样性、复杂性与自然度上均受限于语法模板。随着大语言模型（Large Language Model, LLM）的发展，其生成的文本具备高度的多样性与自然度。为此，我们提出了一种结合语法模板与LLM的文本生成策略：(1) 首先将1.2节生成的属性标注输入至设计好的语法模板以生成原始文本；(2) 随后将原始文本输入至LLM，生成自然、多样且精准的文本描述。为确保通过LLM生成高质量的描述文本，语法模板生成的原始文本质量至关重要。我们采用概率上下文无关文法（Probabilistic Context-Free Grammars, PCFG）算法生成原始文本，将其拆分为多个短句，每个短句使用不同的属性组合构建。LLM模型本身的性能会影响生成字幕的质量，我们对现有开源LLM进行了调研，最终选择了Qwen-7B-Chat模型。 ## 1.4 FaceCaption-15M数据集统计分析 **与其他主流人脸图像数据集的对比**：符号“#”表示样本（图像或图文对）数量；缩写“mRes”、“Ann”、“mWords”分别表示所有图像的平均分辨率、标注数量与所有文本的平均词数；缩写“Align”表示图像是否仅包含人脸区域。 ![image/png](https://cdn-uploads.huggingface.co/production/uploads/663f06e01cd68975883a353e/1dbj5KMGyc80Jo0Nyeekd.png) ![image/png](https://cdn-uploads.huggingface.co/production/uploads/663f06e01cd68975883a353e/LeoFyl5yNHhy0xbKQ9BS0.png) **图像质量得分分布**：(a) BRISQUE评估：得分越低表示图像质量越好；(b) CLIPIQA评估：得分越高表示图像质量越好。 ![image/png](https://cdn-uploads.huggingface.co/production/uploads/663f06e01cd68975883a353e/KhNW312RKn8lDsuqFSl92.png) **文本分布**：(a) FaceCaption-15M中五类标注的分布情况；(b) 数据集内不同词数句子的占比；(c) 百分比数据下的唯一4元组数量；(d) LAION-Face与FaceCaption-15M图文对示例。其中FaceCaption*表示未使用LLM、仅通过语法模板生成的字幕。 **注**：与CelebV-Text数据集的对比存在一定不公平性，因为CelebV-Text是视频文本描述数据集，我们仅将每个视频的第一帧作为图像进行对比。[CelebV-Text](https://celebv-text.github.io/)是一款优秀的数据集，若您需要人脸视频图文数据集，请访问其对应的GitHub仓库。 # 2. 局限性与讨论在本研究过程中，我们构建了FaceCaption-15M数据集。然而，在数据集清洗与制作的过程中，不可避免地会引入一定程度的偏差或模型偏见。对此，我们将持续更新该数据集，尽全力降低偏见带来的影响。此外，鉴于肖像权、版权法等相关法律法规的约束，尽管我们已从LAION中成功获取1500万张人脸图像，但我们仍决定遵循LAION数据集的开源发布模式，即仅发布图像原始链接、清洗后的文本描述以及人脸在原始图像中的位置坐标。若您发现您的人脸图像出现在本数据集中，且不希望您的数据被抓取、共享或用于模型训练，请联系我们。我们将对您的信息进行初步审核，并停止在FaceCaption-15M数据集中分发您的数据。需要说明的是，LAION是本数据集的上游数据源，我们无法要求上游数据集停止分发您的照片。大规模人脸数据集的应用场景存在一定限制，而包含自然场景下的民用人像照片具备更高的研究价值。我们基于FaceCaption-15M进一步清洗了自然场景下的人像照片数据集HumanCaption-15M，其文本描述同时兼顾场景描述与人脸细节。敬请期待。由于人脸数据集本身的特殊性，**本数据集仅可用于科学研究用途。** # 3. 联系方式电子邮箱：2018211556@stu.cqupt.edu.cn 或 dw_dai@163.com # 4. 数据集示例 **绿色样本为LAION数据集的示例，红色样本为FaceCaption-15M数据集的示例。** ![image/png](https://cdn-uploads.huggingface.co/production/uploads/663f06e01cd68975883a353e/r9HKtA_ZCRtvIwKIZI4oC.png) # 附加信息 ## 许可信息 FaceCaption-15M数据集由OpenFaceCQUPT发布，仅用于研究与教育目的。该数据集使用Qwen等公开模型生成。用户需注意，本数据集可能包含不准确、不安全的内容或存在偏见，使用前应仔细评估其准确性与适用性。OpenFaceCQUPT及其许可方按“现状”提供本数据集，不提供任何明示或暗示的担保。数据集中表达的观点未必代表OpenFaceCQUPT的立场。 FaceCaption-15M数据集采用知识共享署名4.0国际许可协议（Creative Commons Attribution 4.0 International License, CC-BY 4.0）进行许可。本数据集的发布并不代表邀请任何人将其中的信息用于任何非法或违规用途，或超出研究与教育目的的范围。确保以伦理且负责任的方式使用本数据集，以避免隐私侵犯及其他伦理问题，这一点至关重要。 # 引用 tex @misc{dai202415mmultimodalfacialimagetext, title={15M Multimodal Facial Image-Text Dataset}, author={Dawei Dai and YuTang Li and YingGe Liu and Mingming Jia and Zhang YuanHui and Guoyin Wang}, year={2024}, eprint={2407.08515}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2407.08515}, }

提供机构：

maas

创建时间：

2024-07-19

5,000+

优质数据集

54 个

任务类型

进入经典数据集