vankey/RealText-V2

Name: vankey/RealText-V2
Creator: vankey
Published: 2026-04-21 04:58:26
License: 暂无描述

Hugging Face2026-04-21 更新2026-04-26 收录

下载链接：

https://hf-mirror.com/datasets/vankey/RealText-V2

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-nc-4.0 task_categories: - image-segmentation - text-classification - visual-question-answering language: - en - zh - ar - th - ms - id tags: - document-forgery-analysis - forgery-detection - multilingual - document-analysis - tampering-detection size_categories: - 10K<n<100K --- # RealText-V2: A Large-Scale Multilingual Document Forgery Analysis Benchmark ![RealText-V2 Sample](doc_sample.png) ## 💾 Dataset Description **RealText-V2** is a large-scale multilingual document benchmark dataset purpose-built for multilingual text image forgery analysis, pioneering in both scale and annotation depth. ### Key Features - **20K+ images**: A large-scale benchmark, surpassing existing document forgery analysis datasets by orders of magnitude - **6 languages**: English, Chinese, Arabic, Thai, Malay, and Indonesian — spanning Latin, logographic, Arabic, and Thai script systems, each presenting unique forgery analysis challenges - **6 domains**: Finance, education, healthcare, live streaming, e-commerce, and natural scenes - **Multi-granularity forgery**: Character-level, word-level, and semantic-level tampering - **Multi-source samples**: Real-world and AIGC-synthesized forgery samples covering diverse generation pipelines - **Rich multi-task annotations**: Pixel-level localization masks, tampering type labels, and expert-level natural language explanations ### Competition Timeline **ACM MM 2026 MGC: GenText-Forensics: Challenge on Explainable Forensics and Adversarial Generation for Text-Centric Images** https://www.codabench.org/competitions/15805/ | Phase | Date | | --- | --- | | Competition Launch | April 17, 2026 | | Training Data Release | April 20, 2026 | | Evaluation Submission Opens | May 22, 2026 | | Leaderboard Freeze | May 31, 2026 | | Paper Submission Deadline | June 20, 2026 | | ACM MM 2026, Rio de Janeiro | November 10–14, 2026 | ## 📊 Dataset Structure ``` RealText-V2/ ├── train/ │ ├── image/ # Document images (.jpg for forged, .png for pristine) │ │ ├── part000/ # Sharded at 1000 files per subdirectory │ │ ├── part001/ │ │ └── ... │ ├── mask/ # Binary tampering masks (forged only) │ │ ├── part000/ │ │ └── ... │ └── report/ # Structured forgery analysis reports (.md) │ ├── part000/ │ ├── part001/ │ └── ... ├── doc_sample.png └── metadata.parquet # Index file with sample metadata ``` > **Note:** The test split is withheld for the ongoing ACM MM 2026 competition and will be released after the competition concludes. ### Splits | Split | Total | Black (Forged) | White (Pristine) | |-------|-------|----------------|-------------------| | train | 13,500 | 7,500 | 6,000 | ### Language Distribution (Train) | Language | Code | Black (Forged) | White (Pristine) | |----------|------|----------------|-------------------| | English | en | 2,000 | 1,000 | | Chinese | zh | 2,000 | 1,000 | | Thai | th | 1,000 | 1,000 | | Malay | ms | 1,000 | 1,000 | | Indonesian | id | 1,000 | 1,000 | | Arabic | ar | 500 | 1,000 | ## 📋 Data Fields | Field | Description | |-------|-------------| | `sample_id` | Unique identifier (e.g., `GenText_Forensic_00000000`) | | `language` | Full language name | | `language_code` | ISO 639-1 code | | `type` | `black` (forged) or `white` (pristine) | | `image_file` | Image filename | | `mask_file` | Mask filename (empty for white samples) | | `has_mask` | Whether tampering mask exists | | `report_file` | Report filename | | `report_text` | Full report content | ## 📝 Report Format Each report is a structured markdown document: ```markdown # FORGERY ANALYSIS REPORT **[Conclusion]:** FORGED / PRISTINE **[RISK_SCORE]:** 0-100 ### ANOMALY_001: [type] ([location]) [GROUNDING]: [x1, y1, x2, y2] [REASON]: [explanation text] ## SUMMARY [summary text] ``` ## ⚖️ License This dataset is released under CC-BY-NC-4.0 for research purposes only. ## 🙏 Acknowledgments RealText-V2 is created for the ACM MM 2026 competition on document forgery analysis.

提供机构：

vankey

5,000+

优质数据集

54 个

任务类型

进入经典数据集