five

huangyt/FINETUNE4

收藏
Hugging Face2023-09-16 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/huangyt/FINETUNE4
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: openrail --- ![Change can be sunshine if you let it in..png](https://cdn-uploads.huggingface.co/production/uploads/64c7bfe8ac1016256b69ea02/r9ZWYaWBovYF7HafTEMVb.png) # 📔 **DATASET** | **Dataset** | Class | Number of Questions | | ------- | ----------------------------------------------------------------- | ------------------------ | | **FLAN_CoT(zs)** | Reasoning 、 MATH 、 ScienceQA 、 Commonsense | 8000 | | **Prm800k** | Reasoning 、 MATH | 6713 | | **ScienceQA** | ScienceQA | 5177 | | **SciBench** | ScienceQA | 695 | | **ReClor** | Reasoning | 1624 | | **TheoremQA** | Commonsense 、 MATH 、 ScienceQA | 800 | | **OpenBookQA** | Text_Understanding 、 Reasoning 、 Commonsense 、 ScienceQA | 5957 | | **ARB** | Reasoning 、 MATH 、 ScienceQA 、 Commonsense 、 Text_Understanding | 605 | | **Openassistant-guanaco** | Commonsense 、 Text_Understanding 、 Reasoning | 802 | | **SAT** | Text_Understanding 、 Reasoning 、 MATH | 426 | | **GRE、GMAT** | Reasoning 、 MATH | 254 | | **AMC、AIME** | Reasoning 、 MATH | 1000 | | **LSAT** | Reasoning 、 LAW | 1009 | | **Gaokao-biology** | Comprehensive | 210 | | **Gaokao-chemistry** | Comprehensive | 207 | | **Gaokao-chinese** | Comprehensive | 246 | | **Gaokao-english** | Comprehensive | 306 | | **Gaokao-geography** | Comprehensive | 199 | | **Gaokao-mathcloze** | Comprehensive | 118 | | **Gaokao-mathqa** | Comprehensive | 351 | | **Gaokao-physics** | Comprehensive | 200 | | **LogiQA** | Reasoning | 651 | | **LeetCode** | Reasoning 、 Code | 2359 | # 📌 **Methon** ## *Improving the dataset* Based on the content of the "Textbooks are all you need" paper, We want to try fine-tuning using advanced questions. ## *Dataset Format Definition* Use "instruction、input、output" tend to lean towards guided datasets. In this format, each sample includes an instruction, an input, and an expected output. The instruction provides guidance on how to process the input to generate the output. This format of dataset is often used to train models to perform specific tasks, as they explicitly indicate the operations the model should perform. ``` { "input": "", "output": "", "instruction": "" } ``` - ### [FLAN_V2 COT(ZS)](https://huggingface.co/datasets/conceptofmind/cot_submix_original/tree/main) We only extract the 'zs_opt' from COT and categorize each task. - ### SAT、GRE、GMAT、AMC、AIME、LSAT We will configure the input for datasets such as GRE, GMAT, SAT etc. as "Please read the question and options carefully, then select the most appropriate answer and provide the corresponding explanation." Meanwhile, for the math input, it will be set as "Please provide the answer along with a corresponding explanation based on the given question." Moreover, the questions will be arranged in ascending order of difficulty levels. This is done because, according to the ORCA paper, they started training the model using GPT-3.5 and later transitioned to GPT-4. To avoid the student model from acquiring knowledge beyond its scope and thereby delivering suboptimal results, a progressive learning strategy was utilized. This approach was found to be effective, therefore, in datasets like AMC, AIME which have various levels of difficulty, I have arranged them in a way that embodies this gradual, progressive learning technique. Furthermore, their question and options are combined to form the instruction, and the label and solution are merged to become the output. Lastly, for the LSAT dataset, since it doesn't involve step-by-step processes, the passage is transformed into instruction, while the combination of the question and options serves as the input, and the label represents the output. - ### Gaokao Most of the inputs are configured by us: "Please read and understand the requirements and content of the question carefully, and then choose the option that best fits the description of the question or best answers the question from the options provided." Only gaokao-mathcloze is configured by us: "Please read and comprehend the requirements and content of the question carefully. Gradually ponder upon it and present the most appropriate answer based on your judgment." - ### LeetCode Input configuration: "Analyze the problem description and constraints, then develop a step-by-step Python function to generate the expected output based on the given inputs. Include brief explanations at each step to illustrate your solution process." - ### LogiQA Only perform general conversion - ### [OTHER](https://github.com/arielnlee/Platypus/tree/main/data_pipeline) Prm800k, ScienceQA, SciBench, ReClor, TheoremQA, OpenBookQA, ARB, and OpenAssistant-Guanaco datasets adopt the same format as Platypus. ## *Sampling Algorithms* Since the flan_v2 cot dataset includes tasks like: - cot_esnli - cot_strategyqa - cot_qasc - stream_qed - cot_gsm8k - cot_ecqa - cot_creak - stream_aqua To ensure this dataset contains diverse high-quality data, we first select zs_opt questions. Then, we filter out questions with output lengths exceeding the average length. This step aims to help the model learn richer reasoning steps. After that, we perform stratified sampling. Initially, we attempted stratified sampling before the length-based filtering, but we found that this approach resulted in varying sample sizes, making it challenging to reproduce. Thus, we decided to first filter by length and then perform stratified sampling. ```py import json import random with open("cot_ORIGINAL.json", "r") as f: abc = json.load(f) # --- part1 --- zsopt_data = [] # "zs_opt" for i in abc : if i["template_type"] == "zs_opt": zsopt_data.append(i) # --- part2 --- output_lengths = [len(i["targets"]) for i in zsopt_data] average_length = sum(output_lengths) / len(output_lengths) # average length filtered_data = [] for a in zsopt_data: if len(a["targets"]) >= average_length: filtered_data.append(a) # output length need to >= average_length class_counts = {} # Count the number of samples for each class for a in filtered_data: task_name = a["task_name"] if task_name in class_counts: class_counts[task_name] += 1 else: class_counts[task_name] = 1 # --- part3 --- total_samples = 8000 # we plan to select a total of 8000 samples sample_ratios = {} for task_name, count in class_counts.items(): sample_ratios[task_name] = count / len(filtered_data) sample_sizes = {} for task_name, sample_ratio in sample_ratios.items(): sample_sizes[task_name] = round(sample_ratio * total_samples) stratified_samples = {} # Perform stratified sampling for each class for task_name, sample_size in sample_sizes.items(): class_samples = [] for data in filtered_data: if data["task_name"] == task_name: class_samples.append(data) selected_samples = random.sample(class_samples, sample_size) stratified_samples[task_name] = selected_samples final_samples = [] # Convert to the specified format for task_name, samples in stratified_samples.items(): for sample in samples: final_samples.append( { "input": "", # use "" "output": sample["targets"], # output "instruction": sample["inputs"], # question } ) with open("cot_change.json", "w") as f: json.dump(final_samples, f, indent=2) ``` LSAT arranged according to LEVEL ```py import json with open("math-json.json", "r", encoding="utf-8") as f: data_list = json.load(f) sorted_data = sorted(data_list, key=lambda x: x["other"]["level"]) output_data = [ { "input": "Please provide the answer along with a corresponding explanation based on the given question.", "output": f"{item['answer']},solution:{item['other']['solution']}", "instruction": item["question"], } for item in sorted_data ] with open("math_convert.json", "w", encoding="utf-8") as output_file: json.dump(output_data, output_file, ensure_ascii=False, indent=4) ```
提供机构:
huangyt
原始信息汇总

数据集概述

数据集列表

数据集 类别 问题数量
FLAN_CoT(zs) Reasoning、MATH、ScienceQA、Commonsense 8000
Prm800k Reasoning、MATH 6713
ScienceQA ScienceQA 5177
SciBench ScienceQA 695
ReClor Reasoning 1624
TheoremQA Commonsense、MATH、ScienceQA 800
OpenBookQA Text_Understanding、Reasoning、Commonsense、ScienceQA 5957
ARB Reasoning、MATH、ScienceQA、Commonsense、Text_Understanding 605
Openassistant-guanaco Commonsense、Text_Understanding、Reasoning 802
SAT Text_Understanding、Reasoning、MATH 426
GRE、GMAT Reasoning、MATH 254
AMC、AIME Reasoning、MATH 1000
LSAT Reasoning、LAW 1009
Gaokao-biology Comprehensive 210
Gaokao-chemistry Comprehensive 207
Gaokao-chinese Comprehensive 246
Gaokao-english Comprehensive 306
Gaokao-geography Comprehensive 199
Gaokao-mathcloze Comprehensive 118
Gaokao-mathqa Comprehensive 351
Gaokao-physics Comprehensive 200
LogiQA Reasoning 651
LeetCode Reasoning、Code 2359

数据集格式定义

数据集采用“instruction、input、output”格式,每个样本包括一个指令、一个输入和一个预期的输出。指令提供了如何处理输入以生成输出的指导。这种格式的数据集常用于训练模型执行特定任务,因为它们明确指示了模型应执行的操作。

json { "input": "", "output": "", "instruction": "" }

数据集配置

  • FLAN_V2 COT(ZS): 仅从COT中提取zs_opt,并对每个任务进行分类。
  • SAT、GRE、GMAT、AMC、AIME、LSAT: 对于GRE、GMAT、SAT等数据集,输入配置为“请仔细阅读问题和选项,然后选择最合适的答案并提供相应的解释。”数学输入设置为“请根据给定的问题提供答案及相应的解释。”问题按难度级别升序排列。
  • Gaokao: 大多数输入由我们配置:“请仔细阅读并理解问题的要求和内容,然后从提供的选项中选择最符合问题描述或最佳回答问题的选项。”只有gaokao-mathcloze由我们配置:“请仔细阅读并理解问题的要求和内容。逐步思考并根据你的判断提出最合适的答案。”
  • LeetCode: 输入配置:“分析问题描述和约束,然后开发一个逐步的Python函数,根据给定的输入生成预期的输出。在每一步包含简短的解释以说明你的解决方案过程。”
  • LogiQA: 仅进行一般转换。
  • 其他: Prm800k、ScienceQA、SciBench、ReClor、TheoremQA、OpenBookQA、ARB和OpenAssistant-Guanaco数据集采用与Platypus相同的格式。

采样算法

flan_v2 cot数据集包括以下任务:

  • cot_esnli
  • cot_strategyqa
  • cot_qasc
  • stream_qed
  • cot_gsm8k
  • cot_ecqa
  • cot_creak
  • stream_aqua

为了确保数据集包含多样化的高质量数据,我们首先选择zs_opt问题,然后过滤掉输出长度超过平均长度的问题。之后进行分层抽样。我们首先尝试在长度过滤之前进行分层抽样,但发现这种方法导致样本大小不同,难以复现。因此,我们决定先进行长度过滤,然后进行分层抽样。

python import json import random

with open("cot_ORIGINAL.json", "r") as f: abc = json.load(f)

--- part1 ---

zsopt_data = [] # "zs_opt" for i in abc : if i["template_type"] == "zs_opt": zsopt_data.append(i)

--- part2 ---

output_lengths = [len(i["targets"]) for i in zsopt_data] average_length = sum(output_lengths) / len(output_lengths) # average length

filtered_data = [] for a in zsopt_data: if len(a["targets"]) >= average_length: filtered_data.append(a) # output length need to >= average_length

class_counts = {} # Count the number of samples for each class for a in filtered_data: task_name = a["task_name"] if task_name in class_counts: class_counts[task_name] += 1 else: class_counts[task_name] = 1

--- part3 ---

total_samples = 8000 # we plan to select a total of 8000 samples

sample_ratios = {} for task_name, count in class_counts.items(): sample_ratios[task_name] = count / len(filtered_data)

sample_sizes = {} for task_name, sample_ratio in sample_ratios.items(): sample_sizes[task_name] = round(sample_ratio * total_samples)

stratified_samples = {} # Perform stratified sampling for each class for task_name, sample_size in sample_sizes.items(): class_samples = [] for data in filtered_data: if data["task_name"] == task_name: class_samples.append(data)

selected_samples = random.sample(class_samples, sample_size)
stratified_samples[task_name] = selected_samples

final_samples = [] # Convert to the specified format for task_name, samples in stratified_samples.items(): for sample in samples: final_samples.append( { "input": "", # use "" "output": sample["targets"], # output "instruction": sample["inputs"], # question } )

with open("cot_change.json", "w") as f: json.dump(final_samples, f, indent=2)

LSAT按LEVEL排列

python import json

with open("math-json.json", "r", encoding="utf-8") as f: data_list = json.load(f)

sorted_data = sorted(data_list, key=lambda x: x["other"]["level"])

output_data = [ { "input": "Please provide the answer along with a corresponding explanation based on the given question.", "output": f"{item[answer]},solution:{item[other][solution]}", "instruction": item["question"], } for item in sorted_data ]

with open("math_convert.json", "w", encoding="utf-8") as output_file: json.dump(output_data, output_file, ensure_ascii=False, indent=4)

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作