MU-NLPC/Calc-ape210k

Name: MU-NLPC/Calc-ape210k
Creator: MU-NLPC
Published: 2024-01-22 16:21:58
License: 暂无描述

Hugging Face2024-01-22 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/MU-NLPC/Calc-ape210k

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: mit dataset_info: - config_name: default features: - name: id dtype: string - name: question dtype: string - name: question_chinese dtype: string - name: chain dtype: string - name: result dtype: string - name: result_float dtype: float64 - name: equation dtype: string splits: - name: test num_bytes: 1153807 num_examples: 1785 - name: train num_bytes: 111628273 num_examples: 195179 - name: validation num_bytes: 1169676 num_examples: 1783 download_size: 50706818 dataset_size: 113951756 - config_name: original-splits features: - name: id dtype: string - name: question dtype: string - name: question_chinese dtype: string - name: chain dtype: string - name: result dtype: string - name: result_float dtype: float64 - name: equation dtype: string splits: - name: test num_bytes: 2784396 num_examples: 4867 - name: train num_bytes: 111628273 num_examples: 195179 - name: validation num_bytes: 2789481 num_examples: 4867 download_size: 52107586 dataset_size: 117202150 configs: - config_name: default data_files: - split: test path: data/test-* - split: train path: data/train-* - split: validation path: data/validation-* - config_name: original-splits data_files: - split: test path: original-splits/test-* - split: train path: original-splits/train-* - split: validation path: original-splits/validation-* --- # Dataset Card for Calc-ape210k ## Summary This dataset is an instance of Ape210K dataset, converted to a simple HTML-like language that can be easily parsed (e.g. by BeautifulSoup). The data contains 3 types of tags: - gadget: A tag whose content is intended to be evaluated by calling an external tool (sympy-based calculator in this case) - output: An output of the external tool - result: The final answer to the mathematical problem (a number) ## Supported Tasks The dataset is intended for training Chain-of-Thought reasoning **models able to use external tools** to enhance the factuality of their responses. This dataset presents in-context scenarios where models can outsource the computations in the reasoning chain to a calculator. ## Construction Process First, we translated the questions into English using Google Translate. Next, we parsed the equations and the results. We linearized the equations into a sequence of elementary steps and evaluated them using a sympy-based calculator. We numerically compare the output with the result in the data and remove all examples where they do not match (less than 3% loss in each split). Finally, we save the chain of steps in the HTML-like language in the `chain` column. We keep the original columns in the dataset for convenience. We also perform in-dataset and cross-dataset data-leak detection within [Calc-X collection](https://huggingface.co/collections/MU-NLPC/calc-x-652fee9a6b838fd820055483). Specifically for Ape210k, we removed parts of the validation and test split, with around 1700 remaining in each. You can read more information about this process in our [Calc-X paper](https://arxiv.org/abs/2305.15017). ## Data splits The default config contains filtered splits with data leaks removed. You can load it using: ```python datasets.load_dataset("MU-NLPC/calc-ape210k") ``` In the `original-splits` config, the data splits are unfiltered and correspond to the original Ape210K dataset. See [ape210k dataset github](https://github.com/Chenny0808/ape210k) and [the paper](https://arxiv.org/abs/2009.11506) for more info. You can load it using: ```python datasets.load_dataset("MU-NLPC/calc-ape210k", "original-splits") ``` ## Attributes - **id** - id of the example - **question** - the description of the math problem. Automatically translated from the `question_chinese` column into English using Google Translate - **question_chinese** - the original description of the math problem in Chinese - **chain** - linearized `equation`, sequence of arithmetic steps in HTML-like language that can be evaluated using our sympy-based calculator - **result** - result as a string (can be an integer, float, or a fraction) - **result_float** - result, converted to a float - **equation** - a nested expression that evaluates to the correct answer Attributes **id**, **question**, **chain**, and **result** are present in all datasets in [Calc-X collection](https://huggingface.co/collections/MU-NLPC/calc-x-652fee9a6b838fd820055483). ## Related work This dataset was created as a part of a larger effort in training models capable of using a calculator during inference, which we call Calcformers. - [**Calc-X collection**](https://huggingface.co/collections/MU-NLPC/calc-x-652fee9a6b838fd820055483) - datasets for training Calcformers - [**Calcformers collection**](https://huggingface.co/collections/MU-NLPC/calcformers-65367392badc497807b3caf5) - calculator-using models we trained and published on HF - [**Calc-X and Calcformers paper**](https://arxiv.org/abs/2305.15017) - [**Calc-X and Calcformers repo**](https://github.com/prompteus/calc-x) Here are links to the original dataset: - [**original Ape210k dataset and repo**](https://github.com/Chenny0808/ape210k) - [**original Ape210k paper**](https://arxiv.org/abs/2009.11506) ## Licence MIT, consistently with the original dataset. ## Cite If you use this version of the dataset in research, please cite the [original Ape210k paper](https://arxiv.org/abs/2009.11506), and the [Calc-X paper](https://arxiv.org/abs/2305.15017) as follows: ```bibtex @inproceedings{kadlcik-etal-2023-soft, title = "Calc-X and Calcformers: Empowering Arithmetical Chain-of-Thought through Interaction with Symbolic Systems", author = "Marek Kadlčík and Michal Štefánik and Ondřej Sotolář and Vlastimil Martinek", booktitle = "Proceedings of the The 2023 Conference on Empirical Methods in Natural Language Processing: Main track", month = dec, year = "2023", address = "Singapore, Singapore", publisher = "Association for Computational Linguistics", url = "https://arxiv.org/abs/2305.15017", } ```

许可证：MIT 数据集信息： - 配置名称：default 特征： - 字段名：id，数据类型：字符串 - 字段名：question，数据类型：字符串 - 字段名：question_chinese，数据类型：字符串 - 字段名：chain，数据类型：字符串 - 字段名：result，数据类型：字符串 - 字段名：result_float，数据类型：float64 - 字段名：equation，数据类型：字符串拆分： - 拆分名称：test，字节数：1153807，样本数量：1785 - 拆分名称：train，字节数：111628273，样本数量：195179 - 拆分名称：validation，字节数：1169676，样本数量：1783 下载大小：50706818，数据集总大小：113951756 - 配置名称：original-splits 特征： - 字段名：id，数据类型：字符串 - 字段名：question，数据类型：字符串 - 字段名：question_chinese，数据类型：字符串 - 字段名：chain，数据类型：字符串 - 字段名：result，数据类型：字符串 - 字段名：result_float，数据类型：float64 - 字段名：equation，数据类型：字符串拆分： - 拆分名称：test，字节数：2784396，样本数量：4867 - 拆分名称：train，字节数：111628273，样本数量：195179 - 拆分名称：validation，字节数：2789481，样本数量：4867 下载大小：52107586，数据集总大小：117202150 配置项： - 配置名称：default，数据文件： - 拆分：test，路径：data/test-* - 拆分：train，路径：data/train-* - 拆分：validation，路径：data/validation-* - 配置名称：original-splits，数据文件： - 拆分：test，路径：original-splits/test-* - 拆分：train，路径：original-splits/train-* - 拆分：validation，路径：original-splits/validation-* # Calc-ape210k 数据集卡片 ## 概述本数据集为Ape210K数据集的衍生实例，经转换后得到一种易于解析的类HTML格式语言（可通过BeautifulSoup库解析）。该数据集包含三类标签： - gadget：旨在通过调用外部工具（此处为基于sympy的计算器）进行求值的标签 - output：外部工具的输出结果 - result：数学问题的最终答案（数值形式） ## 支持任务本数据集旨在用于训练**能够调用外部工具以提升回答事实性**的思维链（Chain-of-Thought）推理模型。本数据集提供了可将推理链中的计算任务外包给计算器的上下文场景，供模型学习相关能力。 ## 构建流程首先，我们通过谷歌翻译（Google Translate）将原始中文题目翻译为英文。随后，我们解析方程与结果，将方程线性化为一系列基础算术步骤序列，并通过基于sympy的计算器对其进行求值。我们将计算器输出与数据集中的结果进行数值比对，移除所有比对不匹配的样本（各拆分集的损失率低于3%）。最后，我们将推理步骤链以类HTML语言的形式保存至`chain`字段中。为便于使用，我们保留了数据集中的原始字段。此外，我们在[Calc-X 合集](https://huggingface.co/collections/MU-NLPC/calc-x-652fee9a6b838fd820055483)范围内进行了数据集内与跨数据集的数据泄露检测。针对Ape210K数据集，我们移除了验证集与测试集中的部分样本，最终每个拆分集剩余约1700条样本。您可通过我们的[Calc-X论文](https://arxiv.org/abs/2305.15017)了解该流程的更多细节。 ## 数据拆分默认配置包含已移除数据泄露的过滤后拆分集。您可通过以下代码加载该配置： python datasets.load_dataset("MU-NLPC/calc-ape210k") 在`original-splits`配置中，数据拆分集未经过滤，与原始Ape210K数据集的拆分完全一致。更多信息可参阅[ape210k数据集GitHub仓库](https://github.com/Chenny0808/ape210k)与[相关论文](https://arxiv.org/abs/2009.11506)。您可通过以下代码加载该配置： python datasets.load_dataset("MU-NLPC/calc-ape210k", "original-splits") ## 字段说明 - **id**：样本唯一标识符 - **question**：数学题目的英文描述，通过谷歌翻译从`question_chinese`字段自动翻译得到 - **question_chinese**：数学题目的原始中文描述 - **chain**：线性化的`equation`，即一系列可通过我们基于sympy的计算器求值的类HTML语言格式的算术步骤序列 - **result**：以字符串形式表示的结果（可为整数、浮点数或分数） - **result_float**：转换为浮点数形式的结果 - **equation**：可求值得到正确答案的嵌套表达式字段`id`、`question`、`chain`与`result`存在于[Calc-X 合集](https://huggingface.co/collections/MU-NLPC/calc-x-652fee9a6b838fd820055483)的所有数据集中。 ## 相关工作本数据集是更大规模研究项目的一部分，该项目旨在训练能够在推理过程中调用计算器的模型，我们将其称为Calcformers。 - [**Calc-X 合集**](https://huggingface.co/collections/MU-NLPC/calc-x-652fee9a6b838fd820055483)：用于训练Calcformers的数据集合集 - [**Calcformers 合集**](https://huggingface.co/collections/MU-NLPC/calcformers-65367392badc497807b3caf5)：我们在Hugging Face上发布的已训练完成的可调用计算器的模型合集 - [**Calc-X 与 Calcformers 论文**](https://arxiv.org/abs/2305.15017) - [**Calc-X 与 Calcformers 代码仓库**](https://github.com/prompteus/calc-x) 以下为原始数据集的相关链接： - [**原始Ape210k数据集与代码仓库**](https://github.com/Chenny0808/ape210k) - [**原始Ape210k论文**](https://arxiv.org/abs/2009.11506) ## 许可证 MIT协议，与原始数据集保持一致。 ## 引用如果您在研究中使用本版本的数据集，请同时引用[原始Ape210k论文](https://arxiv.org/abs/2009.11506)与[Calc-X论文](https://arxiv.org/abs/2305.15017)，引用格式如下： bibtex @inproceedings{kadlcik-etal-2023-soft, title = "Calc-X and Calcformers: Empowering Arithmetical Chain-of-Thought through Interaction with Symbolic Systems", author = "Marek Kadlčík and Michal Štefánik and Ondřej Sotolář and Vlastimil Martinek", booktitle = "Proceedings of the The 2023 Conference on Empirical Methods in Natural Language Processing: Main track", month = dec, year = "2023", address = "Singapore, Singapore", publisher = "Association for Computational Linguistics", url = "https://arxiv.org/abs/2305.15017", }

提供机构：

MU-NLPC

原始信息汇总

数据集概述

数据集名称

名称: Calc-ape210k

数据集配置

配置名称: default, original-splits

数据集特征

id: 字符串类型
question: 字符串类型
question_chinese: 字符串类型
chain: 字符串类型
result: 字符串类型
result_float: 浮点数类型
equation: 字符串类型

数据集分割

默认配置
- 测试集: 1785个样本，1153807字节
- 训练集: 195179个样本，111628273字节
- 验证集: 1783个样本，1169676字节
- 下载大小: 50706818字节
- 数据集大小: 113951756字节
原始分割配置
- 测试集: 4867个样本，2784396字节
- 训练集: 195179个样本，111628273字节
- 验证集: 4867个样本，2789481字节
- 下载大小: 52107586字节
- 数据集大小: 117202150字节

数据集加载

默认配置: datasets.load_dataset("MU-NLPC/calc-ape210k")
原始分割配置: datasets.load_dataset("MU-NLPC/calc-ape210k", "original-splits")

数据集属性

id: 样本ID
question: 数学问题描述，英文翻译自question_chinese
question_chinese: 数学问题原始描述，中文
chain: 线性化equation，HTML-like语言中的算术步骤序列
result: 结果，字符串形式（整数、浮点数或分数）
result_float: 结果，转换为浮点数
equation: 嵌套表达式，评估为正确答案

许可证

许可证: MIT

搜集汇总

数据集介绍

构建方式

本数据集的构建过程始于将原始数学问题翻译成英文，接着解析方程和结果，将方程线性化为一系列基本步骤，并使用基于sympy的计算器进行评估。通过数值比较输出和结果，移除不匹配的示例，最后将步骤链以HTML-like语言的形式保存在`chain`列中。在构建过程中，还进行了数据泄露检测，并针对Ape210k数据集调整了验证集和测试集的大小。

特点

MU-NLPC/Calc-ape210k数据集的特点在于，它为训练能够使用外部工具增强响应真实性的Chain-of-Thought推理模型提供了上下文场景。数据集包含了三种类型的标签：gadget（调用外部工具进行评估的内容），output（外部工具的输出），result（数学问题的最终答案）。此外，该数据集经过仔细的数据泄露检测和处理，确保了数据质量。

使用方法

使用该数据集时，可以通过HuggingFace的datasets库加载默认配置或原始分割配置。在默认配置中，数据泄露已被移除，而原始分割配置则保持数据集的原始状态。用户可以根据需要选择合适的配置进行加载和训练。

背景与挑战

背景概述

MU-NLPC/Calc-ape210k数据集是在自然语言处理与数学计算交叉领域的一项重要成果，旨在训练能够运用外部工具如计算器以增强其回答真实性的Chain-of-Thought推理模型。该数据集基于Ape210K数据集，由MU-NLPC团队在2023年通过翻译、解析、计算并验证数学问题及答案的过程构建而成，为相关研究领域提供了丰富的资源。数据集的核心研究问题是提高模型在数学计算任务中的准确性，其对自然语言处理和数学建模的交叉应用领域产生了显著影响。

当前挑战

该数据集在构建过程中遇到了将数学问题转化为计算步骤的挑战，以及确保这些步骤能够在sympy-based计算器上正确执行的技术挑战。此外，数据集还面临了数据泄露检测和清理的挑战，以确保训练和测试的公正性和有效性。在研究领域中，MU-NLPC/Calc-ape210k数据集所解决的挑战包括如何使模型在处理复杂的数学问题时，能够准确无误地使用外部计算工具。

常用场景

经典使用场景

在自然语言处理与数学推理的交汇领域，MU-NLPC/Calc-ape210k数据集扮演着至关重要的角色。该数据集的经典使用场景在于训练能够运用外部工具进行链式思维推理的模型，例如，通过调用基于sympy的计程器来增强响应的事实性。数据集提供了丰富的上下文情景，使得模型可以在推理链中外包计算任务，从而提高数学问题的解决能力。

衍生相关工作

MU-NLPC/Calc-ape210k数据集的衍生工作包括Calc-X集合和Calcformers集合，这些工作致力于训练和发布能够在推理过程中使用计算器的模型。相关的研究成果不仅推动了数学推理模型的进步，也为自然语言处理领域带来了新的研究视角和应用可能性。

数据集最近研究