KK04/LogicInference_OA

Name: KK04/LogicInference_OA
Creator: KK04
Published: 2023-04-05 15:38:22
License: 暂无描述

Hugging Face2023-04-05 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/KK04/LogicInference_OA

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: features: - name: INSTRUCTION dtype: string - name: RESPONSE dtype: string - name: SOURCE dtype: string splits: - name: train num_bytes: 30414202 num_examples: 54607 download_size: 7588805 dataset_size: 30414202 license: apache-2.0 task_categories: - question-answering language: - en tags: - Logic Inference size_categories: - 10K<n<100K --- # Dataset Card for "LogicInference_OA" This is an re-produce of the dataset from LogicInference Dataset in paper: https://openreview.net/pdf?id=HAGeIS_Lcg9. The github page of LogicInference Dataset: https://github.com/google-research/google-research/tree/master/logic_inference_dataset. This dataset is aimed to offer more dataset for Open Assistant project, depending on their demands, there three columns: INSTRUCTION, RESPONSE, SOURCE. The results in this dataset is a little different from which was introduced in the original paper: 1.For all three splits (IID/OOD/length), only IID is used. In the original paper, it seems that model can reach better performance with data generated by this split method. 2.In the original paper, there are two form of responses: LOGICINFERENCEb (with the answer at the beginning) and LOGICINFERENCEe (with the answer at the end). This dataset uses LOGICINFERENCEe, that means: for all questions, the model will first do logic inference, and give the final answer at the end. 3.The original paper, some parameters in generate_dataset.py are: N_INFERENCE_PROBLEMS = 5000 N_VARIATIONS = 25 N_EXAMPLES = 200000 TRAIN_RATIO = 0.9 LENGTH_SPLIT_THRESHOLD = 4 RANDOM_SEED = 0 I choose some new parameters: N_INFERENCE_PROBLEMS = 10000 N_VARIATIONS = 25 N_EXAMPLES = 55000 TRAIN_RATIO = 1 LENGTH_SPLIT_THRESHOLD = 4 RANDOM_SEED = 1111 The original script generated 4814 different inference problems and extended all those inference problems to around 200,000 Q-A pairs. My settings generated 5491 different inference problems and extended them to around 54,607 Instruction-Response pairs. I think for Open Assistant projects, maybe the number of different inference problems is more important, and generated many similar Instruction-Response pairs will only add training time and doesn't make much sense.

提供机构：

KK04

原始信息汇总

数据集概述

基本信息

数据集名称: LogicInference_OA
许可证: Apache-2.0
语言: 英语
任务类别: 问答
标签: 逻辑推理
大小类别: 10K<n<100K

数据集结构

特征:
- INSTRUCTION: 字符串类型
- RESPONSE: 字符串类型
- SOURCE: 字符串类型

数据集划分

训练集:
- 示例数量: 54607
- 存储大小: 30414202字节
- 下载大小: 7588805字节

数据集生成参数

N_INFERENCE_PROBLEMS: 10000
N_VARIATIONS: 25
N_EXAMPLES: 55000
TRAIN_RATIO: 1
LENGTH_SPLIT_THRESHOLD: 4
RANDOM_SEED: 1111

数据集特点

响应格式: 所有问题的逻辑推理结果在最后给出（LOGICINFERENCEe）
数据集用途: 专为Open Assistant项目设计，根据其需求提供数据

5,000+

优质数据集

54 个

任务类型

进入经典数据集