涉黄图片识别数据集
收藏国家基础学科公共科学数据中心2026-01-17 收录
下载链接:
https://nbsdc.cn/general/dataDetail?id=6967bdac195d26230e9b11a8&type=1
下载链接
链接失效反馈官方服务:
资源简介:
本数据集面向2020–2021年互联网内容治理与涉黄图片自动识别应用场景,构建了一个用于成人内容检测的二分类图像数据集(涉黄/非涉黄),可支撑内容审核、敏感内容过滤、深度学习模型训练与鲁棒性评测等研究任务。数据样本主要于 2020 年从社交媒体、电影内容(截图/剧照)以及公共用途图像平台等公开渠道采集,并在汇总阶段尽量覆盖不同拍摄条件、分辨率与视觉风格,同时纳入一定比例的“易混淆/难例”样本,以贴近真实审核环境中的误报与漏报挑战。数据加工过程中,首先对原始图片进行可读性与完整性检查,剔除损坏、无法解码与明显异常样本,并进行去重与标签一致性核验;随后进行图像标准化处理,并对低质样本采用适度的质量增强手段以提升可辨识度与数据一致性。在模型训练使用上,建议在不改变语义的前提下采用温和的数据增强,以提升模型对尺度、光照与背景变化的适应能力。
This dataset is developed for the application scenarios of internet content governance and automatic pornographic image recognition spanning 2020–2021. It is constructed as a binary classification image dataset for adult content detection (pornographic/non-pornographic), supporting research tasks such as content moderation, sensitive content filtering, deep learning model training, and robustness evaluation. Most of the data samples were collected in 2020 from public sources including social media, movie content (screenshots/stills), and public image platforms. During the aggregation phase, efforts were made to cover diverse shooting conditions, resolutions, and visual styles. A certain proportion of "confusing/hard examples" were also included to simulate the challenges of false positives and false negatives encountered in real content moderation environments. In the data processing workflow, original images were first checked for readability and integrity, with damaged, undecodable, and obviously abnormal samples removed. Deduplication and label consistency verification were then conducted. Next, image standardization processing was performed, and appropriate quality enhancement techniques were applied to low-quality samples to improve recognizability and data consistency. For model training and deployment, it is recommended to use mild data augmentation without altering the semantic content, to enhance the model's adaptability to variations in scale, lighting, and background.
提供机构:
北京理工大学
搜集汇总
数据集介绍

背景与挑战
背景概述
该数据集是一个用于成人内容检测的二分类图像数据集(涉黄/非涉黄),面向2020-2021年互联网内容治理与自动识别应用场景,覆盖社交媒体、电影内容等多种公开来源,包含易混淆样本以贴近真实审核挑战。数据经过标准化和质量增强处理,适用于内容审核、敏感内容过滤及深度学习模型训练与评测。
以上内容由遇见数据集搜集并总结生成



