five

karthik-2905/OpenFake

收藏
Hugging Face2025-12-09 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/karthik-2905/OpenFake
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: features: - name: image dtype: image - name: prompt dtype: string - name: label dtype: string - name: model dtype: string splits: - name: train num_bytes: 1051540257907.984 num_examples: 1870684 - name: test num_bytes: 33418712589.0 num_examples: 59658 download_size: 1083933904266 dataset_size: 1084958970496.984 configs: - config_name: default data_files: - split: train path: data/train-* - split: test path: data/test-* license: cc-by-sa-4.0 task_categories: - image-classification language: - en size_categories: - 100K<n<1M --- # Dataset Card for OpenFake ## Dataset Details ### Dataset Description OpenFake is a dataset designed for evaluating deepfake detection and misinformation mitigation in the context of politically relevant media. It includes high-resolution real and synthetic images generated from prompts with political relevance, including faces of public figures, events (e.g., disasters, protests), and multimodal meme-style images with text overlays. Each image includes structured metadata with its prompt, source model (for synthetic), and human-annotated or pipeline-assigned labels. * **Authors:** Victor Livernoche; Akshatha Arodi; Andreea Musulan; Zachary Yang; Adam Salvail; Gaétan Marceau Caron; Jean-François Godbout; Reihaneh Rabbany * **Curated by:** Victor Livernoche; Akshatha Arodi; Jie Zang * **Funded by:** CIFAR AI Chairs Program; Centre for the Study of Democratic Citizenship (CSDC); IVADO; Canada First Research Excellence Fund; Mila (financial support and computational resources) * **Language(s) (prompts):** English * **License:** CC-BY-SA-4.0. Note: subsets produced with proprietary generators are released under non-commercial terms due to “non-compete” clauses; see paper for details. ### Dataset Sources - **Repository:** [https://huggingface.co/datasets/ComplexDataLab/OpenFake](https://huggingface.co/datasets/ComplexDataLab/OpenFake) - **Arena (crowdsourced adversarial platform):** [https://huggingface.co/spaces/CDL-AMLRT/OpenFakeArena](https://huggingface.co/spaces/CDL-AMLRT/OpenFakeArena) ## Uses ### Direct Use * Benchmarking binary classifiers for real vs. synthetic image detection * Evaluating robustness across models and content types (faces, events, memes) * Training adversarially robust detectors via community submissions (OpenFake Arena) ### Out-of-Scope Use * Training generative models directly on the dataset without consent * Any use of personal imagery that violates platform rules or privacy ## Dataset Structure * `image`: image (real or synthetic) * `label`: `real` or `fake` * `model`: the model that generated the synthetic image * `prompt`: prompt used to generate the synthetic image or caption for a real image Train/test split is balanced by label and curated for visual and topical diversity. No image overlaps between splits. **Unused metadata:** `unused_metadata.csv` contains URLs and prompts for images not included in the train/test splits. ## Models Covered Synthetic images were generated from a diverse set of state-of-the-art generators, including: - Stable Diffusion **1.5**, **2.1**, **XL**, **3.5** - Flux **1.0-dev**, **1.1-Pro**, **1.0-Schnell** - Midjourney **v6**, **v7** - **DALL·E 3**, **Imagen 3**, **Imagen 4** - **GPT Image 1**, **Ideogram 3.0**, **Grok-2**, **HiDream-I1**, **Recraft v3**, **Chroma** - Plus 10 community LoRA/finetuned variants of SD 1.5/XL and Flux-dev All images are produced at ~1 MP with varied aspect ratios reflecting common social-media formats. ## Dataset Creation ### Curation Rationale The goal is to fill a gap in deepfake detection datasets by covering high-quality, politically sensitive synthetic imagery and going beyond face-only benchmarks to include events and hybrid image-text memes. The dataset pairs ~3M politically themed real images (filtered from LAION-400M using Qwen2.5-VL) with ~963k synthetic counterparts, and is complemented by the OpenFake Arena for continual hard negative generation. ### Source Data **Real images.** Selected from LAION-400M and filtered with Qwen2.5-VL to retain faces and politically salient or newsworthy events. Detailed captions are produced to drive T2I generation and Arena prompts. **Synthetic images.** Generated using the model list above from a shared prompt bank. Open-source models follow documented generation settings for reproducibility. #### Who are the source data producers? * Real: news outlets, political users, and public social-media posts * Synthetic: produced by researchers and community contributors from prompts; Arena submissions are gated by CLIP for prompt relevance and logged with metadata #### Personal and Sensitive Information Source data was filtered to reduce personal or sensitive content; see the paper’s ethics and licensing notes. ## Bias, Risks, and Limitations There may be overrepresentation of Western political events due to source distribution. Synthetic examples inherit generator biases. Not all labels are exhaustively human-verified. Adversarial use is a risk, mitigated by licensing and the dataset’s focus on detection. ### Recommendations Use caution when interpreting political narratives in images. Do not use for content generation or facial identity research without additional review. ## Citation **BibTeX:** ```bibtex @misc{livernoche2025openfakeopendatasetplatform, title={OpenFake: An Open Dataset and Platform Toward Large-Scale Deepfake Detection}, author={Victor Livernoche and Akshatha Arodi and Andreea Musulan and Zachary Yang and Adam Salvail and Gaétan Marceau Caron and Jean-François Godbout and Reihaneh Rabbany}, year={2025}, eprint={2509.09495}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2509.09495}, } ``` **APA:** Livernoche, V., Arodi, A., Musulan, A., Yang, Z., Salvail, A., Marceau Caron, G., Godbout, J.-F., & Rabbany, R. (2025). OpenFake: An open dataset and platform toward large-scale deepfake detection. arXiv. https://arxiv.org/abs/2509.09495 ## More Information For questions, errors, or contributions, visit the GitHub or HF repository. ## Dataset Card Authors Victor Livernoche ## Dataset Card Contact victor.livernoche@mail.mcgill.ca
提供机构:
karthik-2905
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作