DataCurBench
收藏DataCurBench 数据集概述
📖 数据集简介
- 名称: DataCurBench
- 语言: 英文 (
en)、中文 (zh) - 许可证: Apache-2.0
- 标签:
benchmark、data-curation - 用途: 评估大语言模型在数据过滤(
data_filtering)和数据清洗(data_cleaning)任务中的自主能力。
📂 数据集结构
- 配置:
data_filtering: 数据过滤任务- 英文切分:
data_filtering/en.json - 中文切分:
data_filtering/zh.json
- 英文切分:
data_cleaning: 数据清洗任务- 英文切分:
data_cleaning/en.json - 中文切分:
data_cleaning/zh.json
- 英文切分:
- 格式: JSON Lines (
.json)
🚀 安装与加载
bash pip install datasets
python from datasets import load_dataset
加载英文数据过滤任务
ds_filter_en = load_dataset( "anonymousaiauthor/DataCurBench", name="data_filtering", split="en" )
加载英文数据清洗任务
ds_clean_zh = load_dataset( "anonymousaiauthor/DataCurBench", name="data_cleaning", split="en" )
🔍 数据示例
数据过滤任务 (data_filtering/en.json)
json [ { "id": "en-filter-186", "text": "The Donaldson Adoption Institute...", "decision": "Retain" }, { "id": "en-filter-15", "text": "Mount Aloysius vs Penn State Altoona...", "decision": "Reject" } ]
数据清洗任务 (data_cleaning/en.json)
json [ { "idx": "en-clean-1752", "raw_text": "The novel, Metropolis by The^*&%#a R=exa^n%ds...", "cleaned_text_human": "The novel, Metropolis by Alexander...", "cleaned_text_reference": "The novel, Metropolis by Alexander...", "meta": { "topic": "Literature & Arts", "source": "Encyclopedia", "function": "RemoveNoise", "subtopic": "Book Reviews", "difficulty": 3 } } ]
📝 引用
plaintext Anonymous_AI_Author et al. (2025). DataCurBench: Are LLMs Ready to Self‑Curate Pretraining Data?.
⚠️ 注意事项
- 偏见与安全性: 数据集包含真实网络数据,可能存在偏见或敏感内容。
- 许可证: 数据集基于以下公开数据集构建,均采用 Apache 2.0 许可证:




