five

gretel-math-gsm8k-v0

收藏
魔搭社区2025-11-27 更新2025-05-24 收录
下载链接:
https://modelscope.cn/datasets/gretelai/gretel-math-gsm8k-v0
下载链接
链接失效反馈
官方服务:
资源简介:
# gretelai/gsm8k-synthetic-diverse-405b This dataset is a synthetically generated version inspired by the GSM8K `https://huggingface.co/datasets/openai/gsm8k` dataset, created entirely using **Gretel Navigator with meta-llama/Meta-Llama-3.1-405B** as the agent LLM. It contains ~1500 Grade School-level math word problems with step-by-step solutions, focusing on age group, difficulty, and domain diversity. ## Key Features: - Synthetically Generated: Math problems created using Gretel Navigator, employing evolutionary techniques, LLM-as-a-judge, and verification of annotated calculations via the `sympy` library. - Stratified Test Set: 300 examples for test, remaining for training, stratified by topic and difficulty. - Diverse Contexts and Names: Problems feature a wide range of real-world contexts and include diverse names and ethnicities. - Age Group Labeling: Each problem is tagged with an appropriate age group (grades 2 through 6). - Difficulty Categorization: Problems are categorized as easy, medium, or hard. - Expanded Domains: Covers a wide range of topics including basic algebra, geometry, and more. - Step-by-Step Solutions: Clear reasoning with annotated arithmetic operations. ## Dataset Statistics and Distribution ![meta-llama/Meta-Llama-3.1-405B Dataset Distribution](images/gsm8k-synthetic-diverse-405b_analysis.png) ## Gretel Navigator (selected model: meta-llama/Meta-Llama-3.1-405B) Dataset - Distribution Analysis ### Topic Distribution | topic | Train | Test | |:-------------------------|--------:|-------:| | algebra | 25 | 20 | | arithmetic | 31 | 25 | | compound interest | 26 | 21 | | data interpretation | 27 | 20 | | exponential growth/decay | 25 | 21 | | fractions | 29 | 24 | | geometry | 35 | 29 | | optimization | 23 | 19 | | percentages | 37 | 29 | | polynomials | 21 | 18 | | probability | 20 | 17 | | proportions | 30 | 24 | | ratios | 41 | 33 | ### Difficulty Distribution | difficulty | Train | Test | |:-------------|--------:|-------:| | easy | 93 | 75 | | hard | 82 | 67 | | medium | 101 | 83 | | very hard | 94 | 75 | ## Citation and Usage If you use this dataset in your research or applications, please cite it as: ``` @dataset{gretelai_gsm8k_synthetic, author = {Gretel AI}, title = {Synthetically Generated Math Word Problems Dataset (gsm8k) with enhanced diversity using Gretel Navigator and meta-llama/Meta-Llama-3.1-405B}, year = {2024}, month = {9}, publisher = {Gretel}, howpublished = {https://huggingface.co/gretelai/gsm8k-synthetic-diverse-405b}, } ``` For questions, issues, or additional information, please visit the dataset repository on Hugging Face or contact Gretel AI.

# gretelai/gsm8k-synthetic-diverse-405b 本数据集是受GSM8K(https://huggingface.co/datasets/openai/gsm8k)数据集启发而合成生成的版本,完全通过**Gretel Navigator结合大语言模型(Large Language Model, LLM)meta-llama/Meta-Llama-3.1-405B**作为AI智能体(AI Agent)构建而成。该数据集包含约1500道小学年级数学应用题及分步解题过程,聚焦于年龄段、难度与领域的多样性。 ## 核心特性 - 合成生成:本数据集的数学题目由Gretel Navigator生成,采用进化技术、大语言模型作为评判者(LLM-as-a-judge),并通过`sympy`库(sympy)对标注的计算过程进行验证。 - 分层测试集:包含300条测试样本,剩余样本用于训练,且按主题与难度进行分层采样。 - 多样化语境与命名:题目涵盖广泛的真实世界语境,并包含多元的姓名与族裔特征。 - 年龄段标注:每道题目均标注了适配的年龄段(2至6年级)。 - 难度分级:题目被划分为简单、中等、困难三个等级。 - 扩展领域:覆盖包括基础代数、几何在内的广泛主题。 - 分步解题过程:提供清晰的推理过程与标注化的算术运算步骤。 ## 数据集统计与分布 ![meta-llama/Meta-Llama-3.1-405B 数据集分布](images/gsm8k-synthetic-diverse-405b_analysis.png) ## Gretel Navigator(所选模型:meta-llama/Meta-Llama-3.1-405B)数据集——分布分析 ### 主题分布 | 主题 | 训练集样本数 | 测试集样本数 | |:-------------------------|--------:|-------:| | 代数(algebra) | 25 | 20 | | 算术(arithmetic) | 31 | 25 | | 复利(compound interest) | 26 | 21 | | 数据解读(data interpretation) | 27 | 20 | | 指数增长/衰减(exponential growth/decay) | 25 | 21 | | 分数(fractions) | 29 | 24 | | 几何(geometry) | 35 | 29 | | 优化(optimization) | 23 | 19 | | 百分比(percentages) | 37 | 29 | | 多项式(polynomials) | 21 | 18 | | 概率(probability) | 20 | 17 | | 比例(proportions) | 30 | 24 | | 比率(ratios) | 41 | 33 | ### 难度分布 | 难度等级 | 训练集样本数 | 测试集样本数 | |:-------------|--------:|-------:| | 简单(easy) | 93 | 75 | | 困难(hard) | 82 | 67 | | 中等(medium) | 101 | 83 | | 极难(very hard) | 94 | 75 | ## 引用与使用说明 如果您在研究或应用中使用本数据集,请按如下格式引用: @dataset{gretelai_gsm8k_synthetic, author = {Gretel AI}, title = {使用Gretel Navigator与meta-llama/Meta-Llama-3.1-405B构建的多样化合成数学应用题数据集(gsm8k)}, year = {2024}, month = {9}, publisher = {Gretel}, howpublished = {https://huggingface.co/gretelai/gsm8k-synthetic-diverse-405b}, } 如需咨询、反馈或获取更多信息,请访问Hugging Face上的数据集仓库,或联系Gretel AI。
提供机构:
maas
创建时间:
2025-05-20
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作