shuyuej/metamath_gsm8k

Name: shuyuej/metamath_gsm8k
Creator: shuyuej
Published: 2024-01-25 19:44:59
License: 暂无描述

Hugging Face2024-01-25 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/shuyuej/metamath_gsm8k

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: apache-2.0 --- # 🚀 GSM8K training set The original answer is "\n#### Value" and now is "\n#### Value\nThe answer is: Value", and the answer is cleaned, which is **consistent with the answer format with "meta-math/MetaMathQA"**. ## 💻 Dataset Usage Run the following command to load the data: ```python from datasets import load_dataset dataset = load_dataset("shuyuej/metamath_gsm8k") dataset = dataset['train'] print(dataset) ``` # 📝 Dataset modification codes ```python # coding=utf-8 import re import jsonlines from datasets import load_dataset, Features, Value def clean_up(sentence): # Find all the locations of "<<" matches = [match.start() for match in re.finditer(r'<<', sentence)] for match in matches: # Get the left 20 characters of each "<<" left_chars = sentence[match-20:match] # Replace "x" or "X" to "*" if they are in the left 20 characters modified_chars = sentence[match-20:match].replace('x', '*').replace('X', '*') # Modify the original sentence if 'x' in left_chars or 'X' in left_chars: sentence = sentence.replace(left_chars, modified_chars) ############################################################################################################## # Define a pattern to match text between "<< and >>" pattern = r"<<(.*?)>>" # Use re.sub to replace matched patterns with an empty string sentence = re.sub(pattern, "", sentence) ############################################################################################################## # Find all occurrences of "*" asterisks = [i for i, char in enumerate(sentence) if char == '*'] # Check and add spaces around "*" for index in reversed(asterisks): if index > 0 and index < len(sentence) - 1 and sentence[index - 1] != ' ' and sentence[index + 1] != ' ': sentence = sentence[:index] + ' ' + sentence[index] + ' ' + sentence[index + 1:] elif index > 0 and index < len(sentence) - 1 and sentence[index - 1] != ' ' and sentence[index + 1] == ' ': sentence = sentence[:index] + ' ' + sentence[index] + sentence[index + 1:] elif index > 0 and index < len(sentence) - 1 and sentence[index - 1] == ' ' and sentence[index + 1] != ' ': sentence = sentence[:index] + sentence[index] + ' ' + sentence[index + 1:] ############################################################################################################## # # Find all occurrences of "/" # asterisks = [i for i, char in enumerate(sentence) if char == '/'] # # # Check and add spaces around "/" # for index in reversed(asterisks): # if index > 0 and index < len(sentence) - 1 and sentence[index - 1] != ' ' and sentence[index + 1] != ' ': # sentence = sentence[:index] + ' ' + sentence[index] + ' ' + sentence[index + 1:] # elif index > 0 and index < len(sentence) - 1 and sentence[index - 1] != ' ' and sentence[index + 1] == ' ': # sentence = sentence[:index] + ' ' + sentence[index] + sentence[index + 1:] # elif index > 0 and index < len(sentence) - 1 and sentence[index - 1] == ' ' and sentence[index + 1] != ' ': # sentence = sentence[:index] + sentence[index] + ' ' + sentence[index + 1:] ############################################################################################################## # Find all occurrences of "+" asterisks = [i for i, char in enumerate(sentence) if char == '+'] # Check and add spaces around "+" for index in reversed(asterisks): if index > 0 and index < len(sentence) - 1 and sentence[index - 1] != ' ' and sentence[index + 1] != ' ': sentence = sentence[:index] + ' ' + sentence[index] + ' ' + sentence[index + 1:] elif index > 0 and index < len(sentence) - 1 and sentence[index - 1] != ' ' and sentence[index + 1] == ' ': sentence = sentence[:index] + ' ' + sentence[index] + sentence[index + 1:] elif index > 0 and index < len(sentence) - 1 and sentence[index - 1] == ' ' and sentence[index + 1] != ' ': sentence = sentence[:index] + sentence[index] + ' ' + sentence[index + 1:] ############################################################################################################## # Find all occurrences of "-" asterisks = [i for i, char in enumerate(sentence) if char == '-'] # Check and add spaces around "-" for index in reversed(asterisks): if index > 0 and index < len(sentence) - 1 and sentence[index - 1] != ' ' and sentence[index + 1] != ' ': sentence = sentence[:index] + ' ' + sentence[index] + ' ' + sentence[index + 1:] elif index > 0 and index < len(sentence) - 1 and sentence[index - 1] != ' ' and sentence[index + 1] == ' ': sentence = sentence[:index] + ' ' + sentence[index] + sentence[index + 1:] elif index > 0 and index < len(sentence) - 1 and sentence[index - 1] == ' ' and sentence[index + 1] != ' ': sentence = sentence[:index] + sentence[index] + ' ' + sentence[index + 1:] ############################################################################################################## # Find all occurrences of "=" asterisks = [i for i, char in enumerate(sentence) if char == '='] # Check and add spaces around "=" for index in reversed(asterisks): if index > 0 and index < len(sentence) - 1 and sentence[index - 1] != ' ' and sentence[index + 1] != ' ': sentence = sentence[:index] + ' ' + sentence[index] + ' ' + sentence[index + 1:] elif index > 0 and index < len(sentence) - 1 and sentence[index - 1] != ' ' and sentence[index + 1] == ' ': sentence = sentence[:index] + ' ' + sentence[index] + sentence[index + 1:] elif index > 0 and index < len(sentence) - 1 and sentence[index - 1] == ' ' and sentence[index + 1] != ' ': sentence = sentence[:index] + sentence[index] + ' ' + sentence[index + 1:] ############################################################################################################## # Find all occurrences of "." dots_locations = [match.start() for match in re.finditer(r'\.', sentence)] # Check and modify "." if the left side is space and the right side is a numerical number for dot_location in reversed(dots_locations): if sentence[dot_location - 1].isspace() and sentence[dot_location + 1].isdigit(): sentence = sentence[:dot_location] + '0' + sentence[dot_location:] ############################################################################################################## # Check if there is a "." before "\n#### " if ".\n#### " not in sentence: # If not, add a "." sentence = sentence.replace("\n#### ", ".\n#### ") return sentence # Retrieve the path of training and testing databases context_feat = Features({"question": Value(dtype='string', id=None), "answer": Value(dtype='string', id=None)}) train_set = load_dataset('json', data_files='train.jsonl', split='train', features=context_feat) data = [] for example in train_set: number = example['answer'].split('#### ')[1] number = int(number.replace(',', '')) append = "\nThe answer is: " + str(number) answer = example['answer'] + append answer = clean_up(sentence=answer) question = example['question'] data.append({"question": question, "answer": answer}) # Save the modified data to a jsonl file output_file = 'gsm8k_train.jsonl' with jsonlines.open(output_file, 'w') as writer: writer.write_all(data) print(f"Modified data saved to {output_file}") ```

提供机构：

shuyuej

原始信息汇总

GSM8K训练集

数据集使用

运行以下命令加载数据： python from datasets import load_dataset

dataset = load_dataset("shuyuej/metamath_gsm8k") dataset = dataset[train] print(dataset)

数据集修改代码

python

coding=utf-8

import re

import jsonlines from datasets import load_dataset, Features, Value

def clean_up(sentence): # 查找所有"<<"的位置 matches = [match.start() for match in re.finditer(r<<, sentence)]

for match in matches:
    # 获取每个"<<"左边20个字符
    left_chars = sentence[match-20:match]
    # 如果左边20个字符中有"x"或"X"，将其替换为"*"
    modified_chars = sentence[match-20:match].replace(x, *).replace(X, *)

    # 修改原始句子
    if x in left_chars or X in left_chars:
        sentence = sentence.replace(left_chars, modified_chars)

# 定义一个模式来匹配"<<"和">>"之间的文本
pattern = r"<<(.*?)>>"

# 使用re.sub替换匹配的模式为空字符串
sentence = re.sub(pattern, "", sentence)

# 查找所有"*"的位置
asterisks = [i for i, char in enumerate(sentence) if char == *]

# 检查并添加"*"周围的空格
for index in reversed(asterisks):
    if index > 0 and index < len(sentence) - 1 and sentence[index - 1] !=   and sentence[index + 1] !=  :
        sentence = sentence[:index] +   + sentence[index] +   + sentence[index + 1:]
    elif index > 0 and index < len(sentence) - 1 and sentence[index - 1] !=   and sentence[index + 1] ==  :
        sentence = sentence[:index] +   + sentence[index] + sentence[index + 1:]
    elif index > 0 and index < len(sentence) - 1 and sentence[index - 1] ==   and sentence[index + 1] !=  :
        sentence = sentence[:index] + sentence[index] +   + sentence[index + 1:]

# 查找所有"+"的位置
asterisks = [i for i, char in enumerate(sentence) if char == +]

# 检查并添加"+"周围的空格
for index in reversed(asterisks):
    if index > 0 and index < len(sentence) - 1 and sentence[index - 1] !=   and sentence[index + 1] !=  :
        sentence = sentence[:index] +   + sentence[index] +   + sentence[index + 1:]
    elif index > 0 and index < len(sentence) - 1 and sentence[index - 1] !=   and sentence[index + 1] ==  :
        sentence = sentence[:index] +   + sentence[index] + sentence[index + 1:]
    elif index > 0 and index < len(sentence) - 1 and sentence[index - 1] ==   and sentence[index + 1] !=  :
        sentence = sentence[:index] + sentence[index] +   + sentence[index + 1:]

# 查找所有"-"的位置
asterisks = [i for i, char in enumerate(sentence) if char == -]

# 检查并添加"-"周围的空格
for index in reversed(asterisks):
    if index > 0 and index < len(sentence) - 1 and sentence[index - 1] !=   and sentence[index + 1] !=  :
        sentence = sentence[:index] +   + sentence[index] +   + sentence[index + 1:]
    elif index > 0 and index < len(sentence) - 1 and sentence[index - 1] !=   and sentence[index + 1] ==  :
        sentence = sentence[:index] +   + sentence[index] + sentence[index + 1:]
    elif index > 0 and index < len(sentence) - 1 and sentence[index - 1] ==   and sentence[index + 1] !=  :
        sentence = sentence[:index] + sentence[index] +   + sentence[index + 1:]

# 查找所有"="的位置
asterisks = [i for i, char in enumerate(sentence) if char == =]

# 检查并添加"="周围的空格
for index in reversed(asterisks):
    if index > 0 and index < len(sentence) - 1 and sentence[index - 1] !=   and sentence[index + 1] !=  :
        sentence = sentence[:index] +   + sentence[index] +   + sentence[index + 1:]
    elif index > 0 and index < len(sentence) - 1 and sentence[index - 1] !=   and sentence[index + 1] ==  :
        sentence = sentence[:index] +   + sentence[index] + sentence[index + 1:]
    elif index > 0 and index < len(sentence) - 1 and sentence[index - 1] ==   and sentence[index + 1] !=  :
        sentence = sentence[:index] + sentence[index] +   + sentence[index + 1:]

# 查找所有"."的位置
dots_locations = [match.start() for match in re.finditer(r., sentence)]

# 检查并修改"."，如果左边是空格且右边是数字
for dot_location in reversed(dots_locations):
    if sentence[dot_location - 1].isspace() and sentence[dot_location + 1].isdigit():
        sentence = sentence[:dot_location] + 0 + sentence[dot_location:]

# 检查是否有"."在"

"之前

if ".

" not in sentence:

    # 如果没有，添加"."
    sentence = sentence.replace("

", ".

")

return sentence

获取训练和测试数据库的路径

context_feat = Features({"question": Value(dtype=string, id=None), "answer": Value(dtype=string, id=None)}) train_set = load_dataset(json, data_files=train.jsonl, split=train, features=context_feat)

data = [] for example in train_set: number = example[answer].split(#### )[1] number = int(number.replace(,, )) append = " The answer is: " + str(number) answer = example[answer] + append answer = clean_up(sentence=answer)

question = example[question]
data.append({"question": question, "answer": answer})

将修改后的数据保存到jsonl文件

output_file = gsm8k_train.jsonl with jsonlines.open(output_file, w) as writer: writer.write_all(data)

print(f"Modified data saved to {output_file}")

搜集汇总

数据集介绍

构建方式

在数学推理领域，高质量的数据集是模型性能提升的关键。shuyuej/metamath_gsm8k数据集基于原始的GSM8K训练集构建，通过一系列精密的文本清洗与格式化操作，对答案部分进行了标准化处理。具体而言，该构建过程利用正则表达式与字符串操作，移除了答案中的特殊标记（如“<<”与“>>”），并对数学运算符（如“*”、“+”、“-”、“=”）周围的空格进行了规范化调整，确保算术表达式的清晰可读。同时，构建代码还修正了数字与标点之间的格式不一致问题，例如在特定条件下为小数点前补充零，并在答案末尾统一添加了“The answer is: Value”的提示结构，从而与meta-math/MetaMathQA数据集的答案格式保持严格一致。

特点

该数据集的核心特征体现在其高度结构化的答案呈现方式上。经过清洗后，每个样本的答案部分均遵循统一的模板，以“\n#### Value\nThe answer is: Value”的形式清晰展示最终数值结果，这为模型训练提供了明确的学习目标。在内容层面，数据集保留了GSM8K原汁原味的小学数学文字应用题，涵盖基础算术、多步推理等丰富题型，但通过细致的符号标准化与空格处理，消除了原数据中可能存在的格式噪声，提升了数据的纯净度与一致性。这种格式上的严谨性使得该数据集特别适用于微调大型语言模型，旨在增强其数学问题求解与答案格式化的双重能力。

使用方法

为便捷地利用该数据集进行模型训练与研究，用户可通过Hugging Face的datasets库直接加载。仅需执行简单的Python代码，即可将数据集导入工作环境。加载后，数据集默认包含训练集，用户可像操作标准数据集对象一样进行索引、批处理或迭代。鉴于数据已预先完成清洗与格式化，研究者可直接将其用于监督式微调任务，无需额外的预处理步骤。该数据集的设计充分考虑了工程实践的便利性，使得开发者能够迅速整合其内容，专注于模型架构与训练策略的优化，从而高效推进数学推理模型的开发进程。

背景与挑战

背景概述

在自然语言处理领域，数学推理能力是衡量大型语言模型智能水平的重要维度。shuyuej/metamath_gsm8k数据集基于GSM8K构建，后者由OpenAI于2021年推出，旨在评估模型在小学级别数学问题上的多步推理能力。该数据集通过整合MetaMathQA的答案格式，对原始GSM8K训练集进行了标准化清洗与重构，核心研究聚焦于提升语言模型在复杂算术问题中的逻辑推导与符号运算精度。这一工作延续了GSM8K推动数学推理基准发展的学术脉络，为后续研究提供了结构更统一、噪声更低的数据资源，显著促进了数学问题求解模型的迭代与优化。

当前挑战

该数据集致力于解决数学问题求解中的多步推理挑战，要求模型不仅理解自然语言描述的数学场景，还需精确执行算术运算并生成结构化答案。构建过程中的主要挑战在于数据清洗与格式统一：原始GSM8K答案中存在符号表达不一致（如乘号'x'与'*'混用）、运算符间距缺失以及数值格式歧义等问题，需通过正则表达式与规则引擎进行精细化处理，确保与MetaMathQA的答案模板严格对齐。此外，保持数学语义在清洗过程中的完整性，避免因符号替换或空格插入导致逻辑失真，亦是数据重构的关键难点。

常用场景

经典使用场景

在数学推理领域，shuyuej/metamath_gsm8k数据集作为GSM8K训练集的优化版本，其经典使用场景在于为大型语言模型提供高质量的数学问题求解训练数据。该数据集通过精心设计的答案格式清洗与标准化，确保了模型在理解多步骤数学问题、执行算术运算及生成结构化答案方面的一致性，从而成为评估和提升模型数学推理能力的基准工具。

衍生相关工作

围绕该数据集衍生的经典工作包括MetaMathQA项目，该项目利用清洗后的GSM8K数据探索模型自我改进与数据增强策略。后续研究进一步拓展了数学推理的边界，催生了如数学定理证明、符号计算集成以及多模态数学问题求解等一系列创新方向，持续推动着人工智能在逻辑密集型任务中的深度应用。

数据集最近研究