huangyt/FINETUNE3
收藏数据集概述
数据集列表
| 数据集 | 类别 | 问题数量 |
|---|---|---|
| FLAN_CoT(zs) | Reasoning、MATH、ScienceQA、Commonsense | 8000 |
| Prm800k | Reasoning、MATH | 6713 |
| ScienceQA | ScienceQA | 5177 |
| SciBench | ScienceQA | 695 |
| ReClor | Reasoning | 1624 |
| TheoremQA | Commonsense、MATH、ScienceQA | 800 |
| OpenBookQA | Text_Understanding、Reasoning、Commonsense、ScienceQA | 5957 |
| ARB | Reasoning、MATH、ScienceQA、Commonsense、Text_Understanding | 605 |
| Openassistant-guanaco | Commonsense、Text_Understanding、Reasoning | 802 |
| SAT | Text_Understanding、Reasoning、MATH | 426 |
| GRE、GMAT | Reasoning、MATH | 254 |
| AMC、AIME | Reasoning、MATH | 1000 |
| LSAT | Reasoning、LAW | 1009 |
数据集格式定义
数据集采用“instruction、input、output”格式,每个样本包括一个指令、一个输入和一个预期的输出。指令提供了如何处理输入以生成输出的指导。这种格式的数据集通常用于训练模型执行特定任务,因为它们明确指示了模型应执行的操作。
示例格式: json { "input": "", "output": "", "instruction": "" }
数据集配置
-
SAT、GRE、GMAT、AMC、AIME、LSAT:
- 对于GRE、GMAT、SAT等数据集,输入配置为“请仔细阅读问题和选项,然后选择最合适的答案并提供相应的解释。”
- 对于数学输入,配置为“请根据给定的问题提供答案及相应的解释。”
- 问题按难度级别升序排列。
- LSAT数据集不涉及逐步过程,将段落转换为指令,问题和选项组合为输入,标签作为输出。
-
其他数据集:
- Prm800k、ScienceQA、SciBench、ReClor、TheoremQA、OpenBookQA、ARB和OpenAssistant-Guanaco数据集采用与Platypus相同的格式。
采样算法
- 从flan_v2 cot数据集中选择zs_opt问题。
- 过滤掉输出长度超过平均长度的问题,以帮助模型学习更丰富的推理步骤。
- 进行分层采样,先进行长度过滤,然后进行分层采样,以确保样本大小一致。
采样算法示例代码: python import json import random
with open("cot_ORIGINAL.json", "r") as f: abc = json.load(f)
zsopt_data = [] # "zs_opt" for i in abc: if i["template_type"] == "zs_opt": zsopt_data.append(i)
output_lengths = [len(i["targets"]) for i in zsopt_data] average_length = sum(output_lengths) / len(output_lengths) # average length
filtered_data = [] for a in zsopt_data: if len(a["targets"]) >= average_length: filtered_data.append(a) # output length need to >= average_length
class_counts = {} # Count the number of samples for each class for a in filtered_data: task_name = a["task_name"] if task_name in class_counts: class_counts[task_name] += 1 else: class_counts[task_name] = 1
total_samples = 8000 # we plan to select a total of 8000 samples
sample_ratios = {} for task_name, count in class_counts.items(): sample_ratios[task_name] = count / len(filtered_data)
sample_sizes = {} for task_name, sample_ratio in sample_ratios.items(): sample_sizes[task_name] = round(sample_ratio * total_samples)
stratified_samples = {} # Perform stratified sampling for each class for task_name, sample_size in sample_sizes.items(): class_samples = [] for data in filtered_data: if data["task_name"] == task_name: class_samples.append(data)
selected_samples = random.sample(class_samples, sample_size)
stratified_samples[task_name] = selected_samples
final_samples = [] # Convert to the specified format for task_name, samples in stratified_samples.items(): for sample in samples: final_samples.append( { "input": "", # use "" "output": sample["targets"], # output "instruction": sample["inputs"], # question } )
with open("cot_change.json", "w") as f: json.dump(final_samples, f, indent=2)
LSAT按级别排列的示例代码: python import json
with open("math-json.json", "r", encoding="utf-8") as f: data_list = json.load(f)
sorted_data = sorted(data_list, key=lambda x: x["other"]["level"])
output_data = [ { "input": "Please provide the answer along with a corresponding explanation based on the given question.", "output": f"{item[answer]},solution:{item[other][solution]}", "instruction": item["question"], } for item in sorted_data ]
with open("math_convert.json", "w", encoding="utf-8") as output_file: json.dump(output_data, output_file, ensure_ascii=False, indent=4)



