gsm8k-synthetic-diverse-8b
收藏魔搭社区2025-12-05 更新2025-05-24 收录
下载链接:
https://modelscope.cn/datasets/gretelai/gsm8k-synthetic-diverse-8b
下载链接
链接失效反馈官方服务:
资源简介:
# gretelai/gsm8k-synthetic-diverse-8b
This dataset is a synthetically generated version inspired by the GSM8K `https://huggingface.co/datasets/openai/gsm8k` dataset, created entirely using **Gretel Navigator with meta-llama/Meta-Llama-3.1-8B** as the agent LLM. It contains ~1500 Grade School-level math word problems with step-by-step solutions, focusing on age group, difficulty, and domain diversity.
## Key Features:
- Synthetically Generated: Math problems created using Gretel Navigator, employing evolutionary techniques, LLM-as-a-judge, and verification of annotated calculations via the `sympy` library.
- Stratified Test Set: 300 examples for test, remaining for training, stratified by topic and difficulty.
- Diverse Contexts and Names: Problems feature a wide range of real-world contexts and include diverse names and ethnicities.
- Age Group Labeling: Each problem is tagged with an appropriate age group (grades 2 through 6).
- Difficulty Categorization: Problems are categorized as easy, medium, or hard.
- Expanded Domains: Covers a wide range of topics including basic algebra, geometry, and more.
- Step-by-Step Solutions: Clear reasoning with annotated arithmetic operations.
## Dataset Statistics and Distribution

## Gretel Navigator (selected model: meta-llama/Meta-Llama-3.1-8B) Dataset - Distribution Analysis
### Topic Distribution
| topic | Train | Test |
|:-----------------------|--------:|-------:|
| arithmetic | 193 | 38 |
| basic algebra | 179 | 35 |
| data interpretation | 202 | 40 |
| fractions | 181 | 35 |
| geometry | 171 | 33 |
| percentages | 203 | 41 |
| ratios and proportions | 201 | 39 |
| word problems | 198 | 39 |
### Difficulty Distribution
| difficulty | Train | Test |
|:-------------|--------:|-------:|
| easy | 531 | 104 |
| hard | 509 | 101 |
| medium | 488 | 95 |
## Citation and Usage
If you use this dataset in your research or applications, please cite it as:
```
@dataset{gretelai_gsm8k_synthetic,
author = {Gretel AI},
title = {Synthetically Generated Math Word Problems Dataset (gsm8k) with enhanced diversity using Gretel Navigator and meta-llama/Meta-Llama-3.1-8B},
year = {2024},
month = {9},
publisher = {Gretel},
howpublished = {https://huggingface.co/gretelai/gsm8k-synthetic-diverse-8b},
}
```
For questions, issues, or additional information, please visit the dataset repository on Hugging Face or contact Gretel AI.
# gretelai/gsm8k-synthetic-diverse-8b
本数据集是受GSM8K(https://huggingface.co/datasets/openai/gsm8k)数据集启发而合成生成的,完全依托**Gretel Navigator结合meta-llama/Meta-Llama-3.1-8B**作为代理大语言模型(Large Language Model,简称LLM)创建而成。数据集包含约1500道小学水平数学应用题及分步解题过程,着重关注年龄组别、难度与领域多样性。
## 核心特性
- 合成生成:本数据集的数学题目通过Gretel Navigator生成,采用进化式技术、LLM作为评判器,并通过`sympy`库对标注的计算过程进行验证。
- 分层测试集:包含300条测试样本,其余样本用于训练,按主题与难度进行分层采样。
- 多样化场景与命名:题目涵盖广泛的真实世界场景,并包含多元的姓名与族裔设定。
- 年龄组标注:每道题目均标注了对应的年龄组别(2至6年级)。
- 难度分类:题目被划分为简单、中等、困难三个等级。
- 拓展领域:涵盖基础代数、几何学等多类主题。
- 分步解题过程:包含清晰的推理逻辑与标注的算术运算步骤。
## 数据集统计与分布

## Gretel Navigator(选用模型:meta-llama/Meta-Llama-3.1-8B)数据集分布分析
### 主题分布
| 主题 | 训练集 | 测试集 |
|:-----------------------|--------:|-------:|
| 算术运算 | 193 | 38 |
| 基础代数 | 179 | 35 |
| 数据解读 | 202 | 40 |
| 分数 | 181 | 35 |
| 几何学 | 171 | 33 |
| 百分比 | 203 | 41 |
| 比率与比例 | 201 | 39 |
| 应用题 | 198 | 39 |
### 难度分布
| 难度 | 训练集 | 测试集 |
|:-------------|--------:|-------:|
| 简单 | 531 | 104 |
| 困难 | 509 | 101 |
| 中等 | 488 | 95 |
## 引用与使用须知
若您在研究或应用中使用本数据集,请按照以下格式引用:
@dataset{gretelai_gsm8k_synthetic,
author = {Gretel AI},
title = {Synthetically Generated Math Word Problems Dataset (gsm8k) with enhanced diversity using Gretel Navigator and meta-llama/Meta-Llama-3.1-8B},
year = {2024},
month = {9},
publisher = {Gretel},
howpublished = {https://huggingface.co/gretelai/gsm8k-synthetic-diverse-8b},
}
如有疑问、问题或需要更多信息,请访问Hugging Face上的数据集仓库或联系Gretel AI.
提供机构:
maas
创建时间:
2025-05-20



