gsm8k-reasoning-paths-combined

Hugging Face2024-12-20 更新2024-12-21 收录

下载链接：

https://huggingface.co/datasets/gabrielmbmb/gsm8k-reasoning-paths-combined

下载链接

链接失效反馈

官方服务：

资源简介：

该数据集包含一个`pipeline.yaml`文件，可以使用`distilabel` CLI重现生成该数据集的管道。数据集包括'instruction'、'answer'、'generation'和'distilabel_metadata'等特征，其中'distilabel_metadata'包含输入和输出文本的详细信息以及使用的令牌统计信息。数据集结构包括一个包含100个样本的'train'分割。该数据集被标记为'synthetic'、'distilabel'和'rlaif'。

This dataset contains a `pipeline.yaml` file, which enables reproduction of the pipeline used to generate this dataset via the `distilabel` CLI. The dataset includes features such as 'instruction', 'answer', 'generation', and 'distilabel_metadata'. The 'distilabel_metadata' feature contains detailed information of the input and output texts as well as token statistics. The dataset structure has a 'train' split with 100 samples. This dataset is labeled with 'synthetic', 'distilabel', and 'rlaif'.

创建时间：

2024-12-20

原始信息汇总

数据集概述

数据集名称

gsm8k-reasoning-paths-combined

数据集大小

下载大小：443656 字节
数据集大小：1915207 字节

数据集结构

特征

instruction: 字符串类型
answer: 字符串类型
generation: 字符串序列类型
distilabel_metadata: 列表类型
- raw_input_text_generation_0: 列表类型
  - content: 字符串类型
  - role: 字符串类型
- raw_output_text_generation_0: 字符串类型
- statistics_text_generation_0: 结构类型
  - input_tokens: 整数类型 (int64)
  - output_tokens: 整数类型 (int64)
model_name: 字符串类型

数据分割

train: 100 个样本，1915207 字节

配置

config_name: default
- data_files:
  - split: train
  - path: data/train-*

数据集摘要

该数据集包含一个 pipeline.yaml 文件，可以使用 distilabel CLI 重现生成该数据集的管道。

数据集示例结构

每个配置的示例结构如下：

配置: default

json { "answer": "Janet sells 16 - 3 - 4 = u003cu003c16-3-4=9u003eu003e9 duck eggs a day. She makes 9 * 2 = $u003cu003c9*2=18u003eu003e18 every day at the farmeru2019s market.

18",

"distilabel_metadata": [
    {
        "raw_input_text_generation_0": [
            {
                "content": "Janetu2019s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmersu0027 market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmersu0027 market?",
                "role": "user"
            }
        ],
        "raw_output_text_generation_0": "## Step 1: Calculate the total number of eggs laid by the ducks per day

Janetu0027s ducks lay 16 eggs per day.

Step 2: Calculate the number of eggs Janet eats for breakfast

Janet eats 3 eggs for breakfast every morning.

Step 3: Calculate the number of eggs Janet uses to bake muffins

Janet bakes muffins with 4 eggs every day.

Step 4: Calculate the total number of eggs used

Total eggs used = 3 (breakfast) + 4 (muffins) = 7 eggs.

Step 5: Calculate the number of eggs left to sell at the farmersu0027 market

Eggs left to sell = Total eggs laid - Total eggs used = 16 - 7 = 9 eggs.

Step 6: Calculate the total amount of money Janet makes every day at the farmersu0027 market

Janet sells the eggs at $2 per egg, so the total amount of money she makes = 9 eggs * $2/egg = $18.

The final answer is: $\boxed{18}$", "statistics_text_generation_0": { "input_tokens": 100, "output_tokens": 229 } }, ... ], "generation": [ "## Step 1: Calculate the total number of eggs laid by the ducks per day Janetu0027s ducks lay 16 eggs per day.

Step 2: Calculate the number of eggs Janet eats for breakfast

Janet eats 3 eggs for breakfast every morning.

Step 3: Calculate the number of eggs Janet uses to bake muffins

Janet bakes muffins with 4 eggs every day.

Step 4: Calculate the total number of eggs used

Total eggs used = 3 (breakfast) + 4 (muffins) = 7 eggs.

Step 5: Calculate the number of eggs left to sell at the farmersu0027 market

Eggs left to sell = Total eggs laid - Total eggs used = 16 - 7 = 9 eggs.

Step 6: Calculate the total amount of money Janet makes every day at the farmersu0027 market

Janet sells the eggs at $2 per egg, so the total amount of money she makes = 9 eggs * $2/egg = $18.

The final answer is: $\boxed{18}$", ... ] }

搜集汇总

数据集介绍

构建方式

该数据集通过使用[distilabel](https://distilabel.argilla.io/)工具构建，旨在生成一个包含推理路径的合成数据集。数据集的构建过程依赖于`pipeline.yaml`配置文件，用户可以通过`distilabel` CLI工具运行该配置文件来复现数据集的生成过程。每个样本包含详细的推理步骤，展示了从输入到输出的完整逻辑路径，确保了数据集的高质量和可复现性。

特点

该数据集的主要特点在于其合成性和推理路径的详细性。每个样本不仅包含用户输入和模型生成的答案，还详细记录了推理过程中的每一步骤，使得数据集在训练和评估模型时具有高度的透明性和可解释性。此外，数据集还包含了输入和输出的token统计信息，便于用户进行性能分析和优化。

使用方法

用户可以通过`distilabel` CLI工具运行`pipeline.yaml`配置文件来复现数据集的生成过程，或通过`distilabel pipeline info`命令查看配置详情。数据集的结构清晰，包含`instruction`、`answer`、`generation`等字段，用户可以直接使用这些字段进行模型训练或评估。此外，数据集的`distilabel_metadata`字段提供了详细的输入输出信息和统计数据，便于用户进行深入分析。

背景与挑战

背景概述

gsm8k-reasoning-paths-combined数据集由Distilabel工具创建，主要用于支持自然语言处理中的推理路径生成任务。该数据集的核心研究问题是如何通过结构化的推理步骤生成高质量的答案，特别是在数学和逻辑推理领域。通过提供详细的推理步骤和答案，该数据集旨在帮助研究人员开发和评估能够进行复杂推理的模型。其创建时间未明确提及，但Distilabel工具的引入表明该数据集是近期在自然语言处理领域中的一项重要贡献。

当前挑战

该数据集的主要挑战在于如何生成准确且逻辑清晰的推理路径。首先，构建过程中需要确保每个推理步骤的正确性，这要求对输入问题进行深入分析。其次，生成的推理路径必须具备可解释性，以便模型能够清晰地展示其推理过程。此外，数据集的规模较小（仅100个样本），这可能限制其在复杂模型训练中的应用，尤其是在需要大量数据进行泛化的情况下。

常用场景

经典使用场景

gsm8k-reasoning-paths-combined数据集的经典使用场景主要集中在数学推理和问题解决领域。该数据集通过提供详细的步骤分解和逻辑推理路径，帮助模型理解和解决复杂的数学问题。例如，数据集中的示例展示了如何逐步计算Janet每天在农贸市场上通过出售鸭蛋所赚取的金额，涵盖了从计算总蛋数到扣除消耗再到最终收益的全过程。这种详细的推理路径为模型提供了清晰的指导，使其能够在类似的问题中进行有效的推理和解答。

解决学术问题

该数据集解决了数学推理任务中的关键学术问题，特别是在复杂问题的逐步分解和逻辑推理方面。通过提供详细的推理步骤和逻辑路径，该数据集帮助模型在面对复杂数学问题时，能够系统地进行问题分解和逐步求解。这不仅提升了模型的推理能力，还为学术界提供了一个评估和改进数学推理模型的重要工具。其意义在于推动了数学推理领域的研究进展，并为未来的智能教育系统提供了坚实的基础。

衍生相关工作

基于gsm8k-reasoning-paths-combined数据集，许多相关工作得以展开，特别是在数学推理和自然语言处理领域。例如，研究人员可以利用该数据集开发更高效的数学推理模型，通过学习数据集中的推理路径来提升模型的推理能力。此外，该数据集还激发了在教育技术领域的创新，推动了智能教育系统和自动化辅导工具的发展。在学术界，该数据集也被广泛用于评估和比较不同推理模型的性能，进一步推动了相关领域的研究进展。

以上内容由遇见数据集搜集并总结生成