heegyu/orca-math-korean-preference-cleaned

Name: heegyu/orca-math-korean-preference-cleaned
Creator: heegyu
Published: 2024-07-18 05:55:26
License: 暂无描述

Hugging Face2024-07-18 更新2024-07-22 收录

下载链接：

https://hf-mirror.com/datasets/heegyu/orca-math-korean-preference-cleaned

下载链接

链接失效反馈

官方服务：

资源简介：

该数据集是一个包含数学问题和答案的数据集，主要用于训练和评估语言模型在数学问题上的表现。数据集包含多个特征，如llm、question、answer等，并且有一个训练集，包含192,426个样本。数据集的预处理包括去除数学表达式中的空格和缩进，以及去除特定数字重复生成的数据。

This dataset contains math problems and answers, primarily used for training and evaluating language models on math-related tasks. The dataset includes multiple features such as llm, question, answer, etc., and has a training set with 192,426 examples. The preprocessing steps involve removing spaces and indentation in mathematical expressions and filtering out data with repetitive number generation.

提供机构：

heegyu

原始信息汇总

数据集概述

数据集信息

特征:
- llm: 字符串
- question: 字符串
- answer: 字符串
- question_en: 字符串
- answer_en: 字符串
- generated: 字符串
- label: 布尔值
- chosen: 字符串
- rejected: 字符串
分割:
- train:
  - 字节数: 1051241760
  - 样本数: 192426
下载大小: 386947470 字节
数据集大小: 1051241760 字节

配置

配置名称: default
- 数据文件:
  - train: data/train-*

数据预处理

过滤操作:
1. 移除数学公式中的空白缩进
2. 移除重复生成特定数字的数据
示例问题:

죽은 닭의 수 = 400의 40% = 0.40 * 400 = 160마리 닭

预处理代码

简化空白: 将连续的空白替换为一个空白，并移除每行的起始空白和空行。
重复模式检测:
- 检测特定字符的重复（超过50次）
- 检测特定n-gram的重复（超过4次，n-gram大小为3）
代码示例: python from tqdm.auto import tqdm from datasets import load_dataset, Dataset import re from collections import Counter

def simplify_whitespace(text): simplified = re.sub(rs+, , text) simplified = re.sub(r^s+, , simplified, flags=re.MULTILINE) simplified = re.sub(r s* , , simplified) return simplified.strip()

def has_repetition_patterns(text, char_repeat_threshold=50, ngram_repeat_threshold=4, ngram_size=3): char_pattern = r(.)1{ + str(char_repeat_threshold) + ,} if re.search(char_pattern, text): return True return False

dataset = load_dataset("kuotient/orca-math-korean-preference", split="train") new_items = [] for item in tqdm(dataset): item["question"] = simplify_whitespace(item["question"]) item["chosen"] = simplify_whitespace(item["chosen"])
```
q_repetite = has_repetition_patterns(item["question"])
a_repetite = has_repetition_patterns(item["chosen"])

if not q_repetite and not a_repetite:
    new_items.append(item)
```
new_ds = Dataset.from_list(new_items) new_ds.push_to_hub("heegyu/orca-math-korean-preference-cleaned")

5,000+

优质数据集

54 个

任务类型

进入经典数据集