five

KK04/LogicInference_OA

收藏
Hugging Face2023-04-05 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/KK04/LogicInference_OA
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: features: - name: INSTRUCTION dtype: string - name: RESPONSE dtype: string - name: SOURCE dtype: string splits: - name: train num_bytes: 30414202 num_examples: 54607 download_size: 7588805 dataset_size: 30414202 license: apache-2.0 task_categories: - question-answering language: - en tags: - Logic Inference size_categories: - 10K<n<100K --- # Dataset Card for "LogicInference_OA" This is an re-produce of the dataset from LogicInference Dataset in paper: https://openreview.net/pdf?id=HAGeIS_Lcg9. The github page of LogicInference Dataset: https://github.com/google-research/google-research/tree/master/logic_inference_dataset. This dataset is aimed to offer more dataset for Open Assistant project, depending on their demands, there three columns: INSTRUCTION, RESPONSE, SOURCE. The results in this dataset is a little different from which was introduced in the original paper: 1.For all three splits (IID/OOD/length), only IID is used. In the original paper, it seems that model can reach better performance with data generated by this split method. 2.In the original paper, there are two form of responses: LOGICINFERENCE<sub>b</sub> (with the answer at the beginning) and LOGICINFERENCE<sub>e</sub> (with the answer at the end). This dataset uses LOGICINFERENCE<sub>e</sub>, that means: for all questions, the model will first do logic inference, and give the final answer at the end. 3.The original paper, some parameters in generate_dataset.py are: N_INFERENCE_PROBLEMS = 5000 N_VARIATIONS = 25 N_EXAMPLES = 200000 TRAIN_RATIO = 0.9 LENGTH_SPLIT_THRESHOLD = 4 RANDOM_SEED = 0 I choose some new parameters: N_INFERENCE_PROBLEMS = 10000 N_VARIATIONS = 25 N_EXAMPLES = 55000 TRAIN_RATIO = 1 LENGTH_SPLIT_THRESHOLD = 4 RANDOM_SEED = 1111 The original script generated 4814 different inference problems and extended all those inference problems to around 200,000 Q-A pairs. My settings generated 5491 different inference problems and extended them to around 54,607 Instruction-Response pairs. I think for Open Assistant projects, maybe the number of different inference problems is more important, and generated many similar Instruction-Response pairs will only add training time and doesn't make much sense.
提供机构:
KK04
原始信息汇总

数据集概述

基本信息

  • 数据集名称: LogicInference_OA
  • 许可证: Apache-2.0
  • 语言: 英语
  • 任务类别: 问答
  • 标签: 逻辑推理
  • 大小类别: 10K<n<100K

数据集结构

  • 特征:
    • INSTRUCTION: 字符串类型
    • RESPONSE: 字符串类型
    • SOURCE: 字符串类型

数据集划分

  • 训练集:
    • 示例数量: 54607
    • 存储大小: 30414202字节
    • 下载大小: 7588805字节

数据集生成参数

  • N_INFERENCE_PROBLEMS: 10000
  • N_VARIATIONS: 25
  • N_EXAMPLES: 55000
  • TRAIN_RATIO: 1
  • LENGTH_SPLIT_THRESHOLD: 4
  • RANDOM_SEED: 1111

数据集特点

  • 响应格式: 所有问题的逻辑推理结果在最后给出(LOGICINFERENCE<sub>e</sub>)
  • 数据集用途: 专为Open Assistant项目设计,根据其需求提供数据
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作