CodeIO-PyEdu-Reasoning-Raw
收藏魔搭社区2025-12-05 更新2025-02-22 收录
下载链接:
https://modelscope.cn/datasets/hkust-nlp/CodeIO-PyEdu-Reasoning-Raw
下载链接
链接失效反馈官方服务:
资源简介:
# CodeI/O: Condensing Reasoning Patterns via Code Input-Output Prediction
<p align="left">
📑 <a href="https://huggingface.co/papers/2502.07316" target="_blank">Paper</a>    |    🌐 <a href="https://codei-o.github.io/" target="_blank">Project Page</a>    |    💾 <a href="https://huggingface.co/collections/hkust-nlp/codei-o-67a978e28fd926b56a4f55a2" target="_blank">Released Resources</a>    |    📦 <a href="https://github.com/hkust-nlp/CodeIO" target="_blank">Repo</a>
We release the raw data for our processed PythonEdu-Reasoning dataset.
The data format for each line in the `0_368500_filtered_v2_ds25.sced.jsonl` is as follows:
```
{
"problem_description": <the problem description of the function>,
"io_requirements": <the input/output requirements and constraints>,
"refcode": <the reference code, including imported packages (optional), auxiliary functions (optional) and main entrypoint function>,
"funcname": <the function name for the entrypoint function>,
"ios": [
{
"input": <the input arguments>,
"output":<the returned value>
},
...
],
"source": <the source of the raw code files>,
"category": <the reasoning type we assign to this sample>,
"meta": <meta information about this sample>
}
```
Some of the `ios` are empty. The reason is that when executing the code, the input/output sizes are too large and exceed our required constraints. Thus, they are not stored or used later.
*Note: Due to imperfect LLM-based transformations, some problem descriptions do not contain enough information to describe the code. We leave this as future work to further enhance our data and update it to a better version.
## Citation
If you find these resources helpful, please kindly cite as:
```
@article{li2025codeio,
title={CodeI/O: Condensing Reasoning Patterns via Code Input-Output Prediction},
author={Li, Junlong and Guo, Daya and Yang, Dejian and Xu, Runxin and Wu, Yu and He, Junxian},
journal={arXiv preprint arXiv:2502.07316},
year={2025}
}
```
# CodeI/O:通过代码输入-输出预测凝练推理模式
<p align="left">
📑 <a href="https://huggingface.co/papers/2502.07316" target="_blank">论文</a>    |    🌐 <a href="https://codei-o.github.io/" target="_blank">项目主页</a>    |    💾 <a href="https://huggingface.co/collections/hkust-nlp/codei-o-67a978e28fd926b56a4f55a2" target="_blank">已发布资源</a>    |    📦 <a href="https://github.com/hkust-nlp/CodeIO" target="_blank">代码仓库</a>
我们发布了经过处理的PythonEdu-Reasoning数据集的原始数据。
`0_368500_filtered_v2_ds25.sced.jsonl` 文件中的每一行数据格式如下:
{
"problem_description": <该函数对应的问题描述>,
"io_requirements": <输入输出要求与约束条件>,
"refcode": <参考代码,包含导入的包(可选)、辅助函数(可选)以及主入口函数>,
"funcname": <入口函数的函数名>,
"ios": [
{
"input": <输入参数>,
"output":<返回值>
},
...
],
"source": <原始代码文件的来源>,
"category": <我们为该样本标注的推理类型>,
"meta": <该样本的元信息>
}
部分`ios`为空。原因是在代码执行过程中,输入/输出的规模过大,超出了我们设定的约束条件,因此未对其进行存储或后续使用。
*注意:由于基于大语言模型(LLM)的转换存在不完善之处,部分问题描述未能包含足够的代码相关描述信息。我们将进一步优化该数据集并更新至更佳版本作为未来的研究工作。
## 引用
若您认为本数据集资源对您的研究有所帮助,请引用如下文献:
@article{li2025codeio,
title={CodeI/O: Condensing Reasoning Patterns via Code Input-Output Prediction},
author={Li, Junlong and Guo, Daya and Yang, Dejian and Xu, Runxin and Wu, Yu and He, Junxian},
journal={arXiv preprint arXiv:2502.07316},
year={2025}
}
提供机构:
maas
创建时间:
2025-02-17



