DatasetResearch
收藏魔搭社区2026-01-06 更新2025-08-30 收录
下载链接:
https://modelscope.cn/datasets/GAIR/DatasetResearch
下载链接
链接失效反馈官方服务:
资源简介:
# Dataset Research
[](https://arxiv.org/abs/2508.06960)
[](https://github.com/GAIR-NLP/DatasetResearch)
[](https://opensource.org/licenses/Apache-2.0)
This dataset is part of the **DatasetResearch: Benchmarking Agent Systems for Demand-Driven Dataset Discovery** research project, presented in the paper [DatasetResearch: Benchmarking Agent Systems for Demand-Driven Dataset Discovery](https://huggingface.co/papers/2508.06960).

**Abstract:**
The rapid advancement of large language models has fundamentally shifted the bottleneck in AI development from computational power to data availability-with countless valuable datasets remaining hidden across specialized repositories, research appendices, and domain platforms. As reasoning capabilities and deep research methodologies continue to evolve, a critical question emerges: can AI agents transcend conventional search to systematically discover any dataset that meets specific user requirements, enabling truly autonomous demand-driven data curation? We introduce DatasetResearch, the first comprehensive benchmark evaluating AI agents' ability to discover and synthesize datasets from 208 real-world demands across knowledge-intensive and reasoning-intensive tasks. Our tri-dimensional evaluation framework reveals a stark reality: even advanced deep research systems achieve only 22% score on our challenging DatasetResearch-pro subset, exposing the vast gap between current capabilities and perfect dataset discovery. Our analysis uncovers a fundamental dichotomy-search agents excel at knowledge tasks through retrieval breadth, while synthesis agents dominate reasoning challenges via structured generation-yet both catastrophically fail on "corner cases" outside existing distributions. These findings establish the first rigorous baseline for dataset discovery agents and illuminate the path toward AI systems capable of finding any dataset in the digital universe. Our benchmark and comprehensive analysis provide the foundation for the next generation of self-improving AI systems and are publicly available at this https URL .
## Dataset Overview
This collection serves as a comprehensive benchmark for evaluating agent systems designed for demand-driven dataset discovery. It includes:
- **Test Metadata**: Structured metadata for benchmark evaluation tasks (200+ tasks)
- **Test Set Collections**: Two curated collections of diverse NLP datasets (200+ datasets)
- **Multi-domain Coverage**: Datasets spanning knowledge, reasoning, and agent evaluation tasks
- **Multi-lingual Support**: Datasets in English, Chinese, and other languages
## Dataset Structure
```
test_dataset/
├── README.md # This file
├── test_metadata.json # Core metadata for evaluation tasks
├── test_set_generated_1/ # Huggingface collection (91 datasets)
└── test_set_generated_2/ # Paper with Code collection (117 datasets)
```
## Usage
To load the dataset, you can use the Hugging Face `datasets` library:
```python
from datasets import load_dataset
# Load the test metadata for DatasetResearch
dataset = load_dataset("GAIR/DatasetResearch", split="test")
# Explore the dataset
print(dataset)
print(dataset[0])
```
For detailed usage instructions, examples, and integration with the DatasetResearch framework, please visit our [GitHub repository](https://github.com/GAIR-NLP/DatasetResearch).
## Citation
If you use this dataset in your research, please cite our paper:
```bibtex
@misc{li2025datasetresearchbenchmarkingagentsystems,
title={DatasetResearch: Benchmarking Agent Systems for Demand-Driven Dataset Discovery},
author={Keyu Li and Mohan Jiang and Dayuan Fu and Yunze Wu and Xiangkun Hu and Dequan Wang and Pengfei Liu},
year={2025},
eprint={2508.06960},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2508.06960},
}
```
# 数据集研究
[](https://arxiv.org/abs/2508.06960)
[](https://github.com/GAIR-NLP/DatasetResearch)
[](https://opensource.org/licenses/Apache-2.0)
本数据集属于**数据集研究(DatasetResearch:面向需求驱动型数据集发现的智能体系统评测基准)**研究项目,相关成果已在论文[DatasetResearch:面向需求驱动型数据集发现的智能体系统评测基准](https://huggingface.co/papers/2508.06960)中发表。

**摘要:**
大语言模型(Large Language Model, LLM)的快速发展,从根本上推动人工智能开发的瓶颈从计算算力转向数据可得性——大量有价值的数据集仍隐藏在专业仓库、研究附录与领域平台之中。随着推理能力与深度研究方法论的持续演进,一个关键问题应运而生:AI智能体 (AI Agent) 能否突破传统搜索的局限,系统性地挖掘出符合特定用户需求的任意数据集,从而实现真正自主的需求驱动型数据编选?
我们提出了数据集研究(DatasetResearch)基准,这是首个用于评测AI智能体从208项真实需求中挖掘与合成数据集的综合性基准,这些需求覆盖知识密集型与推理密集型任务。我们的三维度评测框架揭示了一个严峻的现实:即便是先进的深度研究系统,在我们极具挑战性的DatasetResearch-pro子集上的得分也仅为22%,这暴露了当前能力与完美数据集发现之间存在的巨大差距。
我们的分析揭示了一个根本性的二分现象:检索型智能体凭借检索广度在知识任务中表现优异,而合成型智能体则通过结构化生成在推理挑战中占据优势——但二者在现有分布之外的“极端边缘案例”上均遭遇了灾难性的失败。这些研究成果为数据集发现智能体建立了首个严谨的基准基线,并为打造能够在数字宇宙中寻获任意数据集的人工智能系统指明了方向。本基准与全面分析为下一代自进化人工智能系统奠定了基础,相关内容已通过此HTTPS链接公开上线。
## 数据集概览
本合集是面向需求驱动型数据集发现任务的智能体系统评测综合性基准,包含以下内容:
- **评测元数据**:用于基准评测任务的结构化元数据(200余项任务)
- **测试集合集**:两大精选的多样化自然语言处理(Natural Language Processing, NLP)数据集合集(200余个数据集)
- **多领域覆盖**:数据集涵盖知识、推理与智能体评测任务
- **多语言支持**:数据集包含英语、中文及其他语种
## 数据集结构
test_dataset/
├── README.md # 本说明文件
├── test_metadata.json # 评测任务核心元数据
├── test_set_generated_1/ # Hugging Face数据集合集(含91个数据集)
└── test_set_generated_2/ # Paper with Code数据集合集(含117个数据集)
## 使用方法
您可以通过Hugging Face的`datasets`库加载本数据集:
python
from datasets import load_dataset
# 加载DatasetResearch的评测元数据
dataset = load_dataset("GAIR/DatasetResearch", split="test")
# 浏览数据集内容
print(dataset)
print(dataset[0])
如需详细的使用指南、示例代码以及与DatasetResearch框架的集成方法,请访问我们的[GitHub仓库](https://github.com/GAIR-NLP/DatasetResearch)。
## 引用声明
若您在研究中使用本数据集,请引用我们的论文:
bibtex
@misc{li2025datasetresearchbenchmarkingagentsystems,
title={DatasetResearch: Benchmarking Agent Systems for Demand-Driven Dataset Discovery},
author={Keyu Li and Mohan Jiang and Dayuan Fu and Yunze Wu and Xiangkun Hu and Dequan Wang and Pengfei Liu},
year={2025},
eprint={2508.06960},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2508.06960},
}
提供机构:
maas
创建时间:
2025-08-12



