Marbyun/internal-datasets

Name: Marbyun/internal-datasets
Creator: Marbyun
Published: 2023-06-07 08:02:08
License: 暂无描述

Hugging Face2023-06-07 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/Marbyun/internal-datasets

下载链接

链接失效反馈

官方服务：

资源简介：

--- annotations_creators: - generated language_creators: - found language: - en license: mit multilinguality: - monolingual size_categories: - 1K<n<10K source_datasets: - original task_categories: - question-answering task_ids: - extractive-qa - open-domain-qa pretty_name: synQA --- # Dataset Card for synQA ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** [Internal-Datasets homepage](https://github.com/Marbyun/datasets-huggingface) - **Point of Contact:** [Marbyun](https://huggingface.co/Marbyun) ### Dataset Summary This Datasets purpose for AI Question-Answering'Datasets. This Dataset inspired by SynQA And SQuAD v1.1 (https://arxiv.org/abs/1606.05250) training set. ### Languages The text in the dataset is in English. The associated BCP-47 code is `en`. ## Dataset Structure ### Data Instances Data is provided in the same format as SQuAD 1.1. An example is shown below: ``` { "data": [ { "title": "None", "paragraphs": [ { "context": "Architecturally, the school has a Catholic character. Atop the Main Building's gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend \"Venite Ad Me Omnes\". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.", "qas": [ { "id": "689f275aacba6c43ff112b2c7cb16129bfa934fa", "question": "What material is the statue of Christ made of?", "answers": [ { "answer_start": 190, "text": "organic copper" } ] }, { "id": "73bd3f52f5934e02332787898f6e568d04bc5403", "question": "Who is on the Main Building's gold dome?", "answers": [ { "answer_start": 111, "text": "the Virgin Mary." } ] }, { "id": "4d459d5b75fd8a6623446290c542f99f1538cf84", "question": "What kind of statue is at the end of the main drive?", "answers": [ { "answer_start": 667, "text": "modern stone" } ] }, { "id": "987a1e469c5b360f142b0a171e15cef17cd68ea6", "question": "What type of dome is on the Main Building at Notre Dame?", "answers": [ { "answer_start": 79, "text": "gold" } ] } ] } ] } ] } ``` ### Data Fields - title: all "None" in this dataset - context: the context/passage - id: a string identifier for each question - answers: a list of all provided answers (one per question in our case, but multiple may exist in SQuAD) with an `answer_start` field which is the character index of the start of the answer span, and a `text` field which is the answer text. ### Data Splits The dataset is composed of a single split of 314,811 examples that we used in a two-stage fine-tuning process (refer to the paper for further details). ## Dataset Creation ### Curation Rationale This dataset was created to investigate the effects of using synthetic adversarial data generation to improve robustness of state-of-the-art QA models. ### Source Data #### Initial Data Collection and Normalization The source passages are from Wikipedia and are the same as those used in [SQuAD v1.1](https://arxiv.org/abs/1606.05250). #### Who are the source language producers? The source language produces are Wikipedia editors for the passages, and a BART-Large generative model for the questions. ### Personal and Sensitive Information No annotator identifying details are provided. ## Considerations for Using the Data ### Social Impact of Dataset The purpose of this dataset is to help develop better question answering systems. A system that succeeds at the supported task would be able to provide an accurate extractive answer from a short passage. This dataset is to be seen as a support resource for improve the ability of systems t handle questions that contemporary state-of-the-art models struggle to answer correctly, thus often requiring more complex comprehension abilities than say detecting phrases explicitly mentioned in the passage with high overlap to the question. It should be noted, however, that the the source passages are both domain-restricted and linguistically specific, and that provided questions and answers do not constitute any particular social application. ### Discussion of Biases The dataset may exhibit various biases in terms of the source passage selection, selected candidate answers, generated questions, quality re-labelling process, as well as any algorithmic biases that may be exacerbated from the adversarial annotation process used to collect the SQuAD and AdversarialQA data on which the generators were trained. ### Other Known Limitations N/a ## Additional Information ### Dataset Curators This Dataset prepared by RnD Team. ### Licensing Information This dataset is distributed under the [MIT License](https://opensource.org/licenses/MIT). ### Citation Information ``` @inproceedings{Rnd-AI-Team, title = "Dataset for Develop AI.", author = "RnD Team,", booktitle = "", month = jun, year = "2023", address = "", publisher = "", url = "", doi = "", pages = "", abstract = "This Dataset prepare by RnD Team for develop AI Question and Answering Chatbot.", } ```

提供机构：

Marbyun

原始信息汇总

数据集概述

数据集名称

名称: synQA

数据集属性

语言: 英语 (en)
许可证: MIT
多语言性: 单语种
大小: 1K<n<10K
源数据: 原始数据
任务类别: 问答
任务ID: 抽取式问答 (extractive-qa), 开放领域问答 (open-domain-qa)

数据集结构

数据实例: 遵循SQuAD 1.1格式，包含标题、段落、问题和答案。
数据字段:
- title: 固定为"None"
- context: 文本段落
- id: 问题标识符
- answers: 包含答案开始位置和文本的列表
数据分割: 单一分割，共314,811个实例

数据集创建

采集理由: 用于研究合成对抗数据生成对提高最先进QA模型鲁棒性的影响。
源数据: 来自Wikipedia的段落，问题由BART-Large生成模型产生。

使用考虑

社会影响: 旨在帮助开发更好的问答系统。
偏见讨论: 可能存在源段落选择、答案选择、问题生成等方面的偏见。

附加信息

数据集准备者: RnD团队
许可证信息: MIT License

5,000+

优质数据集

54 个

任务类型

进入经典数据集