five

huangyt/FINETUNE2

收藏
Hugging Face2023-09-01 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/huangyt/FINETUNE2
下载链接
链接失效反馈
官方服务:
资源简介:
该数据集由多个子数据集组成,包括FLAN_CoT(zs)、Prm800k、ScienceQA、SciBench、ReClor、TheoremQA、OpenBookQA、ARB和Openassistant-guanaco等,涵盖了推理、数学、科学QA、常识等多个类别。数据集的改进方法包括基于finetune1分数的评估、使用高质量小数据集进行测试、结合cot进行分层采样和优化数据集输出长度。数据集格式定义为instruction、input、output,用于指导模型执行特定任务。采样算法包括选择zs_opt问题、过滤输出长度超过平均值的问题、进行分层采样等步骤。

This dataset comprises multiple subsets, including FLAN_CoT(zs), Prm800k, ScienceQA, SciBench, ReClor, TheoremQA, OpenBookQA, ARB, and Openassistant-guanaco, covering various categories such as reasoning, mathematics, scientific QA, and common sense. The improvement methods for this dataset include evaluation based on the Finetune-1 score, testing using high-quality small datasets, stratified sampling combined with Chain-of-Thought (CoT) prompting, and optimization of the dataset output length. The dataset format is defined as instruction, input, and output, which is utilized to guide models in executing specific tasks. The sampling algorithms involve steps such as selecting zs_opt questions, filtering out questions with output lengths exceeding the average value, and implementing stratified sampling.
提供机构:
huangyt
原始信息汇总

数据集概述

数据集列表

Dataset Class Number of Questions
FLAN_CoT(zs) Reasoning, MATH, ScienceQA, Commonsense 8000
Prm800k Reasoning, MATH 6713
ScienceQA ScienceQA 5177
SciBench ScienceQA 695
ReClor Reasoning 1624
TheoremQA Commonsense, MATH, ScienceQA 800
OpenBookQA Text_Understanding, Reasoning, Commonsense, ScienceQA 5957
ARB Reasoning, MATH, ScienceQA, Commonsense, Text_Understanding 605
Openassistant-guanaco Commonsense, Text_Understanding, Reasoning 802

数据集格式定义

数据集采用 "instruction、input、output" 格式,每个样本包含一个指令、一个输入和一个预期的输出。这种格式常用于训练模型执行特定任务,因为它明确指示了模型应执行的操作。

json { "input": "", "output": "", "instruction": "" }

采样算法

FLAN_V2 COT 数据集包含多种任务,如 cot_esnli、cot_strategyqa 等。为了确保数据集包含多样化的高质量数据,首先选择 zs_opt 问题,然后过滤输出长度超过平均长度的问题,最后进行分层采样。

采样步骤

  1. 选择 zs_opt 问题: python zsopt_data = [] for i in abc: if i["template_type"] == "zs_opt": zsopt_data.append(i)

  2. 过滤输出长度: python output_lengths = [len(i["targets"]) for i in zsopt_data] average_length = sum(output_lengths) / len(output_lengths) filtered_data = [] for a in zsopt_data: if len(a["targets"]) >= average_length: filtered_data.append(a)

  3. 分层采样: python class_counts = {} for a in filtered_data: task_name = a["task_name"] if task_name in class_counts: class_counts[task_name] += 1 else: class_counts[task_name] = 1 total_samples = 8000 sample_ratios = {} for task_name, count in class_counts.items(): sample_ratios[task_name] = count / len(filtered_data) sample_sizes = {} for task_name, sample_ratio in sample_ratios.items(): sample_sizes[task_name] = round(sample_ratio * total_samples) stratified_samples = {} for task_name, sample_size in sample_sizes.items(): class_samples = [] for data in filtered_data: if data["task_name"] == task_name: class_samples.append(data) selected_samples = random.sample(class_samples, sample_size) stratified_samples[task_name] = selected_samples final_samples = [] for task_name, samples in stratified_samples.items(): for sample in samples: final_samples.append( { "input": "", "output": sample["targets"], "instruction": sample["inputs"], } ) with open("cot_change.json", "w") as f: json.dump(final_samples, f, indent=2)

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作