ScholarGym
收藏魔搭社区2026-03-11 更新2026-05-03 收录
下载链接:
https://modelscope.cn/datasets/shenhao23/ScholarGym
下载链接
链接失效反馈官方服务:
资源简介:
# ScholarGym: Benchmarking Deep Research Workflows on Academic Literature Retrieval
## Dataset Description
ScholarGym is a static evaluation environment for reproducible assessment of deep research workflows
on academic literature retrieval. It provides a unified benchmark with expert-annotated queries over
a static corpus of 570K papers with deterministic retrieval.
- **Paper:** [arXiv:2601.21654](https://arxiv.org/abs/2601.21654)
- **GitHub:** [https://github.com/shenhao-stu/ScholarGym](https://github.com/shenhao-stu/ScholarGym)
### Dataset Components
**1. scholargym_bench (Query Benchmark)**
- 2,536 expert-annotated research queries
- Sourced from PaSa (AutoScholar + RealScholar) and LitSearch datasets
- Each query includes ground-truth relevant papers with arXiv IDs
- Partitioned into:
- Test-Fast: 200 queries for rapid development iteration
- Test-Hard: 100 challenging queries requiring cross-area retrieval
**2. scholargym_paper_db (Paper Corpus)**
- 570K academic papers spanning computer science, physics, and mathematics
- Enriched with arXiv metadata (title, abstract, publication date, authors)
- Deduplicated by arXiv identifier
- Supports deterministic retrieval for reproducible evaluation
## Usage
```python
from modelscope.msdatasets import MsDataset
ds = MsDataset.load('shenhao23/ScholarGym')
```
## Citation
```bibtex
@article{shen2026scholargym,
title={ScholarGym: Benchmarking Large Language Model Capabilities in the Information-Gathering Stage of Deep Research},
author={Shen, Hao and Yang, Hang and Gu, Zhouhong},
journal={arXiv preprint arXiv:2601.21654},
year={2026}
}
```
## License
This dataset is released under the Apache License 2.0.
## Acknowledgments
We thank the authors of [PaSa](https://github.com/bytedance/pasa) and [LitSearch](https://github.com/princeton-nlp/LitSearch) for providing the base datasets.
# ScholarGym:面向学术文献检索的深度研究工作流评测基准
## 数据集描述
ScholarGym是一款用于对学术文献检索场景下的深度研究工作流开展可复现评估的静态评测环境。该基准提供了统一的测试集,包含经专家标注的查询集合,依托包含57万篇论文的静态文献库,支持确定性检索流程。
- **论文链接**:[arXiv:2601.21654](https://arxiv.org/abs/2601.21654)
- **GitHub仓库**:[https://github.com/shenhao-stu/ScholarGym](https://github.com/shenhao-stu/ScholarGym)
### 数据集组成
**1. scholargym_bench(查询基准集)**
- 包含2536条经专家标注的科研查询
- 数据源自PaSa(AutoScholar + RealScholar)与LitSearch数据集
- 每条查询均附带包含arXiv编号的真值标注(ground truth)相关文献
- 划分为两个子集:
- 快速测试集(Test-Fast):200条查询,用于快速迭代开发
- 困难测试集(Test-Hard):100条具有挑战性的查询,需进行跨领域检索
**2. scholargym_paper_db(文献库)**
- 包含57万篇学术论文,涵盖计算机科学、物理学与数学领域
- 附带arXiv元数据,包括标题、摘要、发表日期与作者信息
- 基于arXiv标识符进行去重处理
- 支持确定性检索,保障评测结果可复现
## 使用方法
python
from modelscope.msdatasets import MsDataset
ds = MsDataset.load('shenhao23/ScholarGym')
## 引用格式
bibtex
@article{shen2026scholargym,
title={ScholarGym: 面向深度研究信息收集阶段的大语言模型(Large Language Model)能力评测},
author={Shen, Hao and Yang, Hang and Gu, Zhouhong},
journal={arXiv预印本 arXiv:2601.21654},
year={2026}
}
## 许可协议
本数据集采用Apache License 2.0协议发布。
## 致谢
我们感谢PaSa与LitSearch的开发者,感谢其提供基础数据集。
提供机构:
maas
创建时间:
2026-02-14



