huangyt/FINETUNE2
收藏数据集概述
数据集列表
| Dataset | Class | Number of Questions |
|---|---|---|
| FLAN_CoT(zs) | Reasoning, MATH, ScienceQA, Commonsense | 8000 |
| Prm800k | Reasoning, MATH | 6713 |
| ScienceQA | ScienceQA | 5177 |
| SciBench | ScienceQA | 695 |
| ReClor | Reasoning | 1624 |
| TheoremQA | Commonsense, MATH, ScienceQA | 800 |
| OpenBookQA | Text_Understanding, Reasoning, Commonsense, ScienceQA | 5957 |
| ARB | Reasoning, MATH, ScienceQA, Commonsense, Text_Understanding | 605 |
| Openassistant-guanaco | Commonsense, Text_Understanding, Reasoning | 802 |
数据集格式定义
数据集采用 "instruction、input、output" 格式,每个样本包含一个指令、一个输入和一个预期的输出。这种格式常用于训练模型执行特定任务,因为它明确指示了模型应执行的操作。
json { "input": "", "output": "", "instruction": "" }
采样算法
FLAN_V2 COT 数据集包含多种任务,如 cot_esnli、cot_strategyqa 等。为了确保数据集包含多样化的高质量数据,首先选择 zs_opt 问题,然后过滤输出长度超过平均长度的问题,最后进行分层采样。
采样步骤
-
选择 zs_opt 问题: python zsopt_data = [] for i in abc: if i["template_type"] == "zs_opt": zsopt_data.append(i)
-
过滤输出长度: python output_lengths = [len(i["targets"]) for i in zsopt_data] average_length = sum(output_lengths) / len(output_lengths) filtered_data = [] for a in zsopt_data: if len(a["targets"]) >= average_length: filtered_data.append(a)
-
分层采样: python class_counts = {} for a in filtered_data: task_name = a["task_name"] if task_name in class_counts: class_counts[task_name] += 1 else: class_counts[task_name] = 1 total_samples = 8000 sample_ratios = {} for task_name, count in class_counts.items(): sample_ratios[task_name] = count / len(filtered_data) sample_sizes = {} for task_name, sample_ratio in sample_ratios.items(): sample_sizes[task_name] = round(sample_ratio * total_samples) stratified_samples = {} for task_name, sample_size in sample_sizes.items(): class_samples = [] for data in filtered_data: if data["task_name"] == task_name: class_samples.append(data) selected_samples = random.sample(class_samples, sample_size) stratified_samples[task_name] = selected_samples final_samples = [] for task_name, samples in stratified_samples.items(): for sample in samples: final_samples.append( { "input": "", "output": sample["targets"], "instruction": sample["inputs"], } ) with open("cot_change.json", "w") as f: json.dump(final_samples, f, indent=2)



