GSM8k中文数据集
收藏魔搭社区2026-05-19 更新2025-02-15 收录
下载链接:
https://modelscope.cn/datasets/testUser/GSM8K_zh
下载链接
链接失效反馈官方服务:
资源简介:
# Dataset
`GSM8K_zh` is a dataset for mathematical reasoning in Chinese, question-answer pairs are translated from GSM8K (https://github.com/openai/grade-school-math/tree/master) by `GPT-3.5-Turbo` with few-shot prompting.
The dataset consists of 7473 training samples and 1319 testing samples. The former is for **supervised fine-tuning**, while the latter is for **evaluation**.
for training samples, `question_zh` and `answer_zh` are question and answer keys, respectively;
for testing samples, only the translated questions are provided (`question_zh`).
# Citation
If you find the `GSM8K_zh` dataset useful for your projects/papers, please cite the following paper.
```bibtex
@article{yu2023metamath,
title={MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models},
author={Yu, Longhui and Jiang, Weisen and Shi, Han and Yu, Jincheng and Liu, Zhengying and Zhang, Yu and Kwok, James T and Li, Zhenguo and Weller, Adrian and Liu, Weiyang},
journal={arXiv preprint arXiv:2309.12284},
year={2023}
}
```
# 数据集
`GSM8K_zh` 是一款面向中文数学推理任务的数据集,其问答对由GPT-3.5-Turbo基于少样本提示(few-shot prompting)从GSM8K(https://github.com/openai/grade-school-math/tree/master)翻译而来。
该数据集共计7473条训练样本与1319条测试样本,前者用于**监督微调(supervised fine-tuning)**,后者用于**模型评估**。
对于训练样本,`question_zh` 与`answer_zh` 分别对应问题与答案的字段;
对于测试样本,仅提供翻译后的问题,其字段为`question_zh`。
# 引用
若您的项目或论文中使用了`GSM8K_zh` 数据集,请引用如下论文:
bibtex
@article{yu2023metamath,
title={MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models},
author={Yu, Longhui and Jiang, Weisen and Shi, Han and Yu, Jincheng and Liu, Zhengying and Zhang, Yu and Kwok, James T and Li, Zhenguo and Weller, Adrian and Liu, Weiyang},
journal={arXiv preprint arXiv:2309.12284},
year={2023}
}
提供机构:
maas
创建时间:
2025-02-09
搜集汇总
数据集介绍

背景与挑战
背景概述
GSM8K中文数据集是一个专门用于中文数学推理的数据集,由英文GSM8K数据集通过GPT-3.5-Turbo翻译而来,包含7473个训练样本和1319个测试样本。训练样本提供中文问题和答案,适用于监督微调;测试样本仅提供问题,用于模型评估。该数据集支持数学推理任务的中文语言处理研究。
以上内容由遇见数据集搜集并总结生成



