Marbyun/internal-datasets
收藏Hugging Face2023-06-07 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/Marbyun/internal-datasets
下载链接
链接失效反馈官方服务:
资源简介:
---
annotations_creators:
- generated
language_creators:
- found
language:
- en
license: mit
multilinguality:
- monolingual
size_categories:
- 1K<n<10K
source_datasets:
- original
task_categories:
- question-answering
task_ids:
- extractive-qa
- open-domain-qa
pretty_name: synQA
---
# Dataset Card for synQA
## Table of Contents
- [Dataset Description](#dataset-description)
- [Dataset Summary](#dataset-summary)
- [Languages](#languages)
- [Dataset Structure](#dataset-structure)
- [Data Instances](#data-instances)
- [Data Fields](#data-fields)
- [Data Splits](#data-splits)
- [Dataset Creation](#dataset-creation)
- [Curation Rationale](#curation-rationale)
- [Source Data](#source-data)
- [Annotations](#annotations)
- [Personal and Sensitive Information](#personal-and-sensitive-information)
- [Considerations for Using the Data](#considerations-for-using-the-data)
- [Social Impact of Dataset](#social-impact-of-dataset)
- [Discussion of Biases](#discussion-of-biases)
- [Other Known Limitations](#other-known-limitations)
- [Additional Information](#additional-information)
- [Dataset Curators](#dataset-curators)
- [Licensing Information](#licensing-information)
- [Citation Information](#citation-information)
- [Contributions](#contributions)
## Dataset Description
- **Homepage:** [Internal-Datasets homepage](https://github.com/Marbyun/datasets-huggingface)
- **Point of Contact:** [Marbyun](https://huggingface.co/Marbyun)
### Dataset Summary
This Datasets purpose for AI Question-Answering'Datasets. This Dataset inspired by SynQA And SQuAD v1.1 (https://arxiv.org/abs/1606.05250) training set.
### Languages
The text in the dataset is in English. The associated BCP-47 code is `en`.
## Dataset Structure
### Data Instances
Data is provided in the same format as SQuAD 1.1. An example is shown below:
```
{
"data": [
{
"title": "None",
"paragraphs": [
{
"context": "Architecturally, the school has a Catholic character. Atop the Main Building's gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend \"Venite Ad Me Omnes\". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.",
"qas": [
{
"id": "689f275aacba6c43ff112b2c7cb16129bfa934fa",
"question": "What material is the statue of Christ made of?",
"answers": [
{
"answer_start": 190,
"text": "organic copper"
}
]
},
{
"id": "73bd3f52f5934e02332787898f6e568d04bc5403",
"question": "Who is on the Main Building's gold dome?",
"answers": [
{
"answer_start": 111,
"text": "the Virgin Mary."
}
]
},
{
"id": "4d459d5b75fd8a6623446290c542f99f1538cf84",
"question": "What kind of statue is at the end of the main drive?",
"answers": [
{
"answer_start": 667,
"text": "modern stone"
}
]
},
{
"id": "987a1e469c5b360f142b0a171e15cef17cd68ea6",
"question": "What type of dome is on the Main Building at Notre Dame?",
"answers": [
{
"answer_start": 79,
"text": "gold"
}
]
}
]
}
]
}
]
}
```
### Data Fields
- title: all "None" in this dataset
- context: the context/passage
- id: a string identifier for each question
- answers: a list of all provided answers (one per question in our case, but multiple may exist in SQuAD) with an `answer_start` field which is the character index of the start of the answer span, and a `text` field which is the answer text.
### Data Splits
The dataset is composed of a single split of 314,811 examples that we used in a two-stage fine-tuning process (refer to the paper for further details).
## Dataset Creation
### Curation Rationale
This dataset was created to investigate the effects of using synthetic adversarial data generation to improve robustness of state-of-the-art QA models.
### Source Data
#### Initial Data Collection and Normalization
The source passages are from Wikipedia and are the same as those used in [SQuAD v1.1](https://arxiv.org/abs/1606.05250).
#### Who are the source language producers?
The source language produces are Wikipedia editors for the passages, and a BART-Large generative model for the questions.
### Personal and Sensitive Information
No annotator identifying details are provided.
## Considerations for Using the Data
### Social Impact of Dataset
The purpose of this dataset is to help develop better question answering systems.
A system that succeeds at the supported task would be able to provide an accurate extractive answer from a short passage. This dataset is to be seen as a support resource for improve the ability of systems t handle questions that contemporary state-of-the-art models struggle to answer correctly, thus often requiring more complex comprehension abilities than say detecting phrases explicitly mentioned in the passage with high overlap to the question.
It should be noted, however, that the the source passages are both domain-restricted and linguistically specific, and that provided questions and answers do not constitute any particular social application.
### Discussion of Biases
The dataset may exhibit various biases in terms of the source passage selection, selected candidate answers, generated questions, quality re-labelling process, as well as any algorithmic biases that may be exacerbated from the adversarial annotation process used to collect the SQuAD and AdversarialQA data on which the generators were trained.
### Other Known Limitations
N/a
## Additional Information
### Dataset Curators
This Dataset prepared by RnD Team.
### Licensing Information
This dataset is distributed under the [MIT License](https://opensource.org/licenses/MIT).
### Citation Information
```
@inproceedings{Rnd-AI-Team,
title = "Dataset for Develop AI.",
author = "RnD Team,",
booktitle = "",
month = jun,
year = "2023",
address = "",
publisher = "",
url = "",
doi = "",
pages = "",
abstract = "This Dataset prepare by RnD Team for develop AI Question and Answering Chatbot.",
}
```
提供机构:
Marbyun
原始信息汇总
数据集概述
数据集名称
- 名称: synQA
数据集属性
- 语言: 英语 (
en) - 许可证: MIT
- 多语言性: 单语种
- 大小: 1K<n<10K
- 源数据: 原始数据
- 任务类别: 问答
- 任务ID: 抽取式问答 (
extractive-qa), 开放领域问答 (open-domain-qa)
数据集结构
- 数据实例: 遵循SQuAD 1.1格式,包含标题、段落、问题和答案。
- 数据字段:
- title: 固定为"None"
- context: 文本段落
- id: 问题标识符
- answers: 包含答案开始位置和文本的列表
- 数据分割: 单一分割,共314,811个实例
数据集创建
- 采集理由: 用于研究合成对抗数据生成对提高最先进QA模型鲁棒性的影响。
- 源数据: 来自Wikipedia的段落,问题由BART-Large生成模型产生。
使用考虑
- 社会影响: 旨在帮助开发更好的问答系统。
- 偏见讨论: 可能存在源段落选择、答案选择、问题生成等方面的偏见。
附加信息
- 数据集准备者: RnD团队
- 许可证信息: MIT License



