KK04/LogicInference_OA
收藏Hugging Face2023-04-05 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/KK04/LogicInference_OA
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: INSTRUCTION
dtype: string
- name: RESPONSE
dtype: string
- name: SOURCE
dtype: string
splits:
- name: train
num_bytes: 30414202
num_examples: 54607
download_size: 7588805
dataset_size: 30414202
license: apache-2.0
task_categories:
- question-answering
language:
- en
tags:
- Logic Inference
size_categories:
- 10K<n<100K
---
# Dataset Card for "LogicInference_OA"
This is an re-produce of the dataset from LogicInference Dataset in paper: https://openreview.net/pdf?id=HAGeIS_Lcg9.
The github page of LogicInference Dataset: https://github.com/google-research/google-research/tree/master/logic_inference_dataset.
This dataset is aimed to offer more dataset for Open Assistant project, depending on their demands, there three columns: INSTRUCTION, RESPONSE, SOURCE.
The results in this dataset is a little different from which was introduced in the original paper:
1.For all three splits (IID/OOD/length), only IID is used. In the original paper, it seems that model can reach better performance with data generated by this split method.
2.In the original paper, there are two form of responses: LOGICINFERENCE<sub>b</sub> (with the answer at the beginning) and LOGICINFERENCE<sub>e</sub> (with the answer at the end). This dataset uses LOGICINFERENCE<sub>e</sub>, that means: for all questions, the model will first do logic inference, and give the final answer at the end.
3.The original paper, some parameters in generate_dataset.py are:
N_INFERENCE_PROBLEMS = 5000
N_VARIATIONS = 25
N_EXAMPLES = 200000
TRAIN_RATIO = 0.9
LENGTH_SPLIT_THRESHOLD = 4
RANDOM_SEED = 0
I choose some new parameters:
N_INFERENCE_PROBLEMS = 10000
N_VARIATIONS = 25
N_EXAMPLES = 55000
TRAIN_RATIO = 1
LENGTH_SPLIT_THRESHOLD = 4
RANDOM_SEED = 1111
The original script generated 4814 different inference problems and extended all those inference problems to around 200,000 Q-A pairs. My settings generated 5491 different inference problems and extended them to around 54,607 Instruction-Response pairs. I think for Open Assistant projects, maybe the number of different inference problems is more important, and generated many similar Instruction-Response pairs will only add training time and doesn't make much sense.
提供机构:
KK04
原始信息汇总
数据集概述
基本信息
- 数据集名称: LogicInference_OA
- 许可证: Apache-2.0
- 语言: 英语
- 任务类别: 问答
- 标签: 逻辑推理
- 大小类别: 10K<n<100K
数据集结构
- 特征:
- INSTRUCTION: 字符串类型
- RESPONSE: 字符串类型
- SOURCE: 字符串类型
数据集划分
- 训练集:
- 示例数量: 54607
- 存储大小: 30414202字节
- 下载大小: 7588805字节
数据集生成参数
- N_INFERENCE_PROBLEMS: 10000
- N_VARIATIONS: 25
- N_EXAMPLES: 55000
- TRAIN_RATIO: 1
- LENGTH_SPLIT_THRESHOLD: 4
- RANDOM_SEED: 1111
数据集特点
- 响应格式: 所有问题的逻辑推理结果在最后给出(LOGICINFERENCE<sub>e</sub>)
- 数据集用途: 专为Open Assistant项目设计,根据其需求提供数据



