AutoSDT-5K
收藏魔搭社区2025-07-24 更新2025-07-05 收录
下载链接:
https://modelscope.cn/datasets/osunlp/AutoSDT-5K
下载链接
链接失效反馈官方服务:
资源简介:
# AutoSDT: Scaling Data-Driven Discovery Tasks Toward Open Co-Scientists
AutoSDT-5K is an **automatically constructed** dataset of 5,404 coding tasks for data-driven discovery that covers four scientific disciplines and 756 unique Python packages. Expert feedback on a subset of 256 tasks shows the quality of AutoSDT-5K: 93% of the collected tasks are ecologically valid, and 92.2% of the synthesized programs are functionally correct. To the best of our knowledge, AutoSDT-5K is the only automatically collected and the largest open dataset for data-driven scientific discovery so far.
Project Page: https://osu-nlp-group.github.io/AutoSDT/
Paper: https://arxiv.org/abs/2506.08140
Code: https://github.com/OSU-NLP-Group/AutoSDT.git
## License
We ensure that all 1325 repositories composing the final tasks in AutoSDT-5K allow for academic use. AutoSDT creates tasks based on open-source code and data, and we respect the creators’ ownership and intellectual property. We have made our best effort to ensure that the repositories included in AutoSDT-5K have permissive licenses allowing for academic use. We provide more details in Appendix G in the paper. We welcome requests from the original authors to modify or remove relevant tasks related to their repositories if needed. We list the licenses and the number of corresponding repositories in the following table:
| **License** | **Repositories** |
|------------------|------------------|
| MIT | 449 |
| GNU | 247 |
| Apache | 145 |
| BSD | 84 |
| CC | 57 |
| Boost | 4 |
| Public Domain | 3 |
| ISC | 1 |
| Eclipse | 1 |
| PolyForm | 1 |
| Mulan | 1 |
| Other (Custom) | 15 |
We manually checked the remaining 15 repositories with custom licenses and ensured that they all allow academic and non-commercial use:
| **Repositories with Custom Licenses** |
|--------------------------------------------|
| GabrieleLozupone/AXIAL |
| fhalab/MLDE |
| snacktavish/TreeToReads |
| usnistgov/SDNist |
| ruppinlab/CSI-Microbes-identification |
| fenchri/edge-oriented-graph |
| SNU-LIST/QSMnet |
| Ramprasad-Group/polygnn |
| gdalessi/OpenMORe |
| svalkiers/clusTCR |
| AI-sandbox/SALAI-Net |
| pixelite1201/agora_evaluation |
| jsunn-y/PolymerGasMembraneML |
| spectrochempy/spectrochempy |
| usnistgov/atomgpt |
There are also 317 repositories without any license information. We assume that these repositories are permissive for academic purposes.
## Citation
Please cite our paper if you use our data, model, or code.
```bibtex
@misc{li2025autosdtscalingdatadrivendiscovery,
title={AutoSDT: Scaling Data-Driven Discovery Tasks Toward Open Co-Scientists},
author={Yifei Li and Hanane Nour Moussa and Ziru Chen and Shijie Chen and Botao Yu and Mingyi Xue and Benjamin Burns and Tzu-Yao Chiu and Vishal Dey and Zitong Lu and Chen Wei and Qianheng Zhang and Tianyu Zhang and Song Gao and Xuhui Huang and Xia Ning and Nesreen K. Ahmed and Ali Payani and Huan Sun},
year={2025},
eprint={2506.08140},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2506.08140},
}
# AutoSDT:面向开放协作科研伙伴的可扩展数据驱动发现任务集
AutoSDT-5K是**自动构建**的数据集,包含5404个用于数据驱动发现的编码任务,涵盖四大科学学科与756个独特Python包(Python)。针对256个任务子集的专家反馈验证了AutoSDT-5K的质量:93%的采集任务具备生态有效性,92.2%的合成程序功能正确。据我们所知,AutoSDT-5K是目前唯一自动采集且规模最大的开源数据驱动科学发现数据集。
项目页面:https://osu-nlp-group.github.io/AutoSDT/
论文:https://arxiv.org/abs/2506.08140
代码:https://github.com/OSU-NLP-Group/AutoSDT.git
## 许可证
我们确保构成AutoSDT-5K最终任务的1325个仓库均支持学术使用。AutoSDT基于开源代码与数据生成任务,我们尊重创作者的所有权与知识产权。我们已尽最大努力确保AutoSDT-5K收录的仓库拥有允许学术使用的宽松许可证,详细信息见论文附录G。我们欢迎原作者提出修改或移除与其仓库相关的任务的请求。下表列出了各类许可证及其对应的仓库数量:
| **许可证类型** | **仓库数量** |
|------------------|------------------|
| MIT许可证(MIT) | 449 |
| GNU许可证(GNU) | 247 |
| Apache许可证(Apache) | 145 |
| BSD许可证(BSD) | 84 |
| CC许可证(CC) | 57 |
| Boost许可证(Boost) | 4 |
| 公有领域(Public Domain) | 3 |
| ISC许可证(ISC) | 1 |
| Eclipse许可证(Eclipse) | 1 |
| PolyForm许可证(PolyForm) | 1 |
| 木兰许可证(Mulan) | 1 |
| 其他自定义许可证(Other (Custom)) | 15 |
我们手动审核了剩余15个采用自定义许可证的仓库,确认其均支持学术与非商业使用:
| **采用自定义许可证的仓库** |
|--------------------------------------------|
| GabrieleLozupone/AXIAL |
| fhalab/MLDE |
| snacktavish/TreeToReads |
| usnistgov/SDNist |
| ruppinlab/CSI-Microbes-identification |
| fenchri/edge-oriented-graph |
| SNU-LIST/QSMnet |
| Ramprasad-Group/polygnn |
| gdalessi/OpenMORe |
| svalkiers/clusTCR |
| AI-sandbox/SALAI-Net |
| pixelite1201/agora_evaluation |
| jsunn-y/PolymerGasMembraneML |
| spectrochempy/spectrochempy |
| usnistgov/atomgpt |
另有317个仓库未提供任何许可证信息,我们默认其允许学术用途。
## 引用
若您使用本数据集、模型或代码,请引用我们的论文。
bibtex
@misc{li2025autosdtscalingdatadrivendiscovery,
title={AutoSDT: Scaling Data-Driven Discovery Tasks Toward Open Co-Scientists},
author={Yifei Li and Hanane Nour Moussa and Ziru Chen and Shijie Chen and Botao Yu and Mingyi Xue and Benjamin Burns and Tzu-Yao Chiu and Vishal Dey and Zitong Lu and Chen Wei and Qianheng Zhang and Tianyu Zhang and Song Gao and Xuhui Huang and Xia Ning and Nesreen K. Ahmed and Ali Payani and Huan Sun},
year={2025},
eprint={2506.08140},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2506.08140},
}
提供机构:
maas
创建时间:
2025-07-04



