fbougares/simple_questions_v2

Name: fbougares/simple_questions_v2
Creator: fbougares
Published: 2024-01-18 11:15:54
License: 暂无描述

Hugging Face2024-01-18 更新2024-05-25 收录

下载链接：

https://hf-mirror.com/datasets/fbougares/simple_questions_v2

下载链接

链接失效反馈

官方服务：

资源简介：

--- annotations_creators: - machine-generated language_creators: - found language: - en license: - cc-by-3.0 multilinguality: - monolingual size_categories: - 100K<n<1M source_datasets: - original task_categories: - question-answering task_ids: - open-domain-qa paperswithcode_id: simplequestions pretty_name: SimpleQuestions dataset_info: - config_name: annotated features: - name: id dtype: string - name: subject_entity dtype: string - name: relationship dtype: string - name: object_entity dtype: string - name: question dtype: string splits: - name: train num_bytes: 12376039 num_examples: 75910 - name: validation num_bytes: 12376039 num_examples: 75910 - name: test num_bytes: 12376039 num_examples: 75910 download_size: 423435590 dataset_size: 37128117 - config_name: freebase2m features: - name: id dtype: string - name: subject_entity dtype: string - name: relationship dtype: string - name: object_entities sequence: string splits: - name: train num_bytes: 1964037256 num_examples: 10843106 download_size: 423435590 dataset_size: 1964037256 - config_name: freebase5m features: - name: id dtype: string - name: subject_entity dtype: string - name: relationship dtype: string - name: object_entities sequence: string splits: - name: train num_bytes: 2481753516 num_examples: 12010500 download_size: 423435590 dataset_size: 2481753516 --- # Dataset Card for SimpleQuestions ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** https://research.fb.com/downloads/babi/ - **Repository:** https://github.com/fbougares/TSAC - **Paper:** https://research.fb.com/publications/large-scale-simple-question-answering-with-memory-networks/ - **Leaderboard:** [If the dataset supports an active leaderboard, add link here]() - **Point of Contact:** [Antoine Borde](abordes@fb.com) [Nicolas Usunie](usunier@fb.com) [Sumit Chopra](spchopra@fb.com), [Jason Weston](jase@fb.com) ### Dataset Summary [More Information Needed] ### Supported Tasks and Leaderboards [More Information Needed] ### Languages [More Information Needed] ## Dataset Structure ### Data Instances Here are some examples of questions and facts: * What American cartoonist is the creator of Andy Lippincott? Fact: (andy_lippincott, character_created_by, garry_trudeau) * Which forest is Fires Creek in? Fact: (fires_creek, containedby, nantahala_national_forest) * What does Jimmy Neutron do? Fact: (jimmy_neutron, fictional_character_occupation, inventor) * What dietary restriction is incompatible with kimchi? Fact: (kimchi, incompatible_with_dietary_restrictions, veganism) ### Data Fields [More Information Needed] ### Data Splits [More Information Needed] ## Dataset Creation ### Curation Rationale [More Information Needed] ### Source Data [More Information Needed] #### Initial Data Collection and Normalization [More Information Needed] #### Who are the source language producers? [More Information Needed] ### Annotations [More Information Needed] #### Annotation process [More Information Needed] #### Who are the annotators? [More Information Needed] ### Personal and Sensitive Information [More Information Needed] ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed] ### Discussion of Biases [More Information Needed] ### Other Known Limitations [More Information Needed] ## Additional Information ### Dataset Curators [More Information Needed] ### Licensing Information [More Information Needed] ### Citation Information [More Information Needed] ### Contributions Thanks to [@abhishekkrthakur](https://github.com/abhishekkrthakur) for adding this dataset.

annotations_creators: - 机器生成 language_creators: - 公开获取 language: - 英语 license: - CC BY 3.0 multilinguality: - 单语言 size_categories: - 10万 < 样本量 < 100万 source_datasets: - 原始数据集 task_categories: - 问答 task_ids: - 开放域问答 paperswithcode_id: simplequestions pretty_name: SimpleQuestions dataset_info: - config_name: annotated features: - name: id dtype: 字符串 - name: subject_entity dtype: 字符串 - name: relationship dtype: 字符串 - name: object_entity dtype: 字符串 - name: question dtype: 字符串 splits: - name: train num_bytes: 12376039 num_examples: 75910 - name: validation num_bytes: 12376039 num_examples: 75910 - name: test num_bytes: 12376039 num_examples: 75910 download_size: 423435590 dataset_size: 37128117 - config_name: freebase2m features: - name: id dtype: 字符串 - name: subject_entity dtype: 字符串 - name: relationship dtype: 字符串 - name: object_entities dtype: 字符串序列 splits: - name: train num_bytes: 1964037256 num_examples: 10843106 download_size: 423435590 dataset_size: 1964037256 - config_name: freebase5m features: - name: id dtype: 字符串 - name: subject_entity dtype: 字符串 - name: relationship dtype: 字符串 - name: object_entities dtype: 字符串序列 splits: - name: train num_bytes: 2481753516 num_examples: 12010500 download_size: 423435590 dataset_size: 2481753516 # SimpleQuestions 数据集卡片 ## 目录 - [数据集描述](#dataset-description) - [数据集概述](#dataset-summary) - [支持任务与排行榜](#supported-tasks-and-leaderboards) - [语言](#languages) - [数据集结构](#dataset-structure) - [数据样例](#data-instances) - [数据字段](#data-fields) - [数据划分](#data-splits) - [数据集构建](#dataset-creation) - [构建依据](#curation-rationale) - [源数据](#source-data) - [标注信息](#annotations) - [个人与敏感信息](#personal-and-sensitive-information) - [数据集使用注意事项](#considerations-for-using-the-data) - [数据集的社会影响](#social-impact-of-dataset) - [偏差讨论](#discussion-of-biases) - [其他已知局限性](#other-known-limitations) - [附加信息](#additional-information) - [数据集维护者](#dataset-curators) - [许可信息](#licensing-information) - [引用信息](#citation-information) - [贡献声明](#contributions) ## 数据集描述 - **项目主页：** https://research.fb.com/downloads/babi/ - **代码仓库：** https://github.com/fbougares/TSAC - **相关论文：** https://research.fb.com/publications/large-scale-simple-question-answering-with-memory-networks/ - **排行榜：** [若数据集支持活跃排行榜，请在此添加链接]() - **联系方式：** [Antoine Borde](abordes@fb.com)、[Nicolas Usunie](usunier@fb.com)、[Sumit Chopra](spchopra@fb.com)、[Jason Weston](jase@fb.com) ### 数据集概述：[需补充更多信息] ### 支持任务与排行榜：[需补充更多信息] ### 语言：[需补充更多信息] ## 数据集结构 ### 数据样例：这里给出部分问题与对应事实样例： * 哪位美国漫画家是安迪·利平科特（Andy Lippincott）的创作者？事实：(andy_lippincott, character_created_by, garry_trudeau) * 火溪（Fires Creek）位于哪片森林中？事实：(fires_creek, containedby, nantahala_national_forest) * 吉米·纽特隆（Jimmy Neutron）的职业是什么？事实：(jimmy_neutron, fictional_character_occupation, inventor) * 哪种饮食禁忌与泡菜（kimchi）不相容？事实：(kimchi, incompatible_with_dietary_restrictions, veganism) ### 数据字段：[需补充更多信息] ### 数据划分：[需补充更多信息] ## 数据集构建 ### 构建依据：[需补充更多信息] ### 源数据： #### 初始数据采集与标准化：[需补充更多信息] #### 源语言数据生产者是谁？：[需补充更多信息] ### 标注信息： #### 标注流程：[需补充更多信息] #### 标注者是谁？：[需补充更多信息] ### 个人与敏感信息：[需补充更多信息] ## 数据集使用注意事项 ### 数据集的社会影响：[需补充更多信息] ### 偏差讨论：[需补充更多信息] ### 其他已知局限性：[需补充更多信息] ## 附加信息 ### 数据集维护者：[需补充更多信息] ### 许可信息：[需补充更多信息] ### 引用信息：[需补充更多信息] ### 贡献声明：感谢 [@abhishekkrthakur](https://github.com/abhishekkrthakur) 贡献此数据集。

提供机构：

fbougares

原始信息汇总

数据集卡片：SimpleQuestions

数据集描述

数据集摘要

annotations_creators: machine-generated
language_creators: found
language: en
license: cc-by-3.0
multilinguality: monolingual
size_categories: 100K<n<1M
source_datasets: original
task_categories: question-answering
task_ids: open-domain-qa
paperswithcode_id: simplequestions
pretty_name: SimpleQuestions

数据集结构

数据实例

config_name: annotated
- features:
  - id: string
  - subject_entity: string
  - relationship: string
  - object_entity: string
  - question: string
- splits:
  - train:
    - num_bytes: 12376039
    - num_examples: 75910
  - validation:
    - num_bytes: 12376039
    - num_examples: 75910
  - test:
    - num_bytes: 12376039
    - num_examples: 75910
- download_size: 423435590
- dataset_size: 37128117
config_name: freebase2m
- features:
  - id: string
  - subject_entity: string
  - relationship: string
  - object_entities: sequence: string
- splits:
  - train:
    - num_bytes: 1964037256
    - num_examples: 10843106
- download_size: 423435590
- dataset_size: 1964037256
config_name: freebase5m
- features:
  - id: string
  - subject_entity: string
  - relationship: string
  - object_entities: sequence: string
- splits:
  - train:
    - num_bytes: 2481753516
    - num_examples: 12010500
- download_size: 423435590
- dataset_size: 2481753516

搜集汇总

数据集介绍

构建方式

SimpleQuestions数据集的构建主要基于机器生成的注释，其结构分为三种配置：annotated、freebase2m和freebase5m。每种配置均包含id、subject_entity、relationship和object_entity等字段，以训练、验证和测试三种数据切分形式存在，体现了数据集的严谨性和实用性。

特点

该数据集的特点在于其专注于开放域问答任务，包含大量简单问题的实例，每个问题都与一个事实三元组相关联，涵盖了广泛的主题实体和对象实体。此外，其多配置的设计允许研究者在不同规模的数据上进行实验，以适应不同的研究需求。

使用方法

使用SimpleQuestions数据集时，研究者可以根据具体的研究目标和任务选择合适的配置。数据集的下载和加载可以通过HuggingFace的库来实现，同时，研究者需要遵循Creative Commons BY 3.0许可证的规定，确保合理使用和引用数据集。

背景与挑战

背景概述

SimpleQuestions数据集，诞生于机器学习与自然语言处理领域，旨在推进开放域问答系统的发展。该数据集由Facebook AI团队于2015年创建，主要研究人员包括Antoine Borde、Nicolas Usunie、Sumit Chopra和Jason Weston等。数据集的核心研究问题是构建能够理解和回答简单事实性问题的系统，对于提升机器理解自然语言的能力具有显著影响。SimpleQuestions数据集包含了数百万个简单的问答对，这些问题涉及广泛的知识领域，为研究者和开发者提供了一个丰富的资源库，以训练和测试他们的问答模型。

当前挑战

在数据集构建过程中，研究人员面临的挑战包括如何从大量原始数据中提取有用信息，并确保问题与答案的对齐准确无误。此外，数据集的多样性和规模也为模型的泛化能力提出了挑战。在研究领域问题方面，SimpleQuestions数据集挑战了传统问答系统的局限性，要求模型不仅要理解语言，还要能够从复杂的关系中抽取和整合信息。同时，数据集中可能存在的偏差和局限性，也促使研究者在使用时需进行深入的探讨和评估。

常用场景

经典使用场景

在自然语言处理领域，fbougares/simple_questions_v2数据集常被用于开展开放领域的问题回答研究。该数据集提供了大量的简单问题及其对应的实体关系，为研究者提供了一个理想的环境，以训练和评估模型在理解自然语言和检索事实方面的能力。

解决学术问题

fbougares/simple_questions_v2数据集解决了在开放领域问题回答中如何准确理解问题意图并快速检索相关事实的学术难题。它通过提供结构化的数据，帮助研究者们开发出可以处理现实世界中多样化问题的智能系统，对提升机器理解自然语言的能力具有重要的研究价值。

衍生相关工作

该数据集催生了大量相关工作，包括但不限于在简单问题回答基础上的复杂问题处理、多语言问题回答系统的研究，以及利用该数据集对记忆网络、知识图谱等技术的改进和创新。

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集