nikhilweee/sharc_modified

Name: nikhilweee/sharc_modified
Creator: nikhilweee
Published: 2024-01-18 11:15:51
License: 暂无描述

Hugging Face2024-01-18 更新2024-05-25 收录

下载链接：

https://hf-mirror.com/datasets/nikhilweee/sharc_modified

下载链接

链接失效反馈

官方服务：

资源简介：

--- annotations_creators: - crowdsourced language_creators: - crowdsourced - expert-generated language: - en license: - unknown multilinguality: - monolingual size_categories: - 10K<n<100K source_datasets: - extended|sharc task_categories: - question-answering task_ids: - extractive-qa paperswithcode_id: null pretty_name: SharcModified tags: - conversational-qa dataset_info: - config_name: mod features: - name: id dtype: string - name: utterance_id dtype: string - name: source_url dtype: string - name: snippet dtype: string - name: question dtype: string - name: scenario dtype: string - name: history list: - name: follow_up_question dtype: string - name: follow_up_answer dtype: string - name: evidence list: - name: follow_up_question dtype: string - name: follow_up_answer dtype: string - name: answer dtype: string splits: - name: train num_bytes: 15138034 num_examples: 21890 - name: validation num_bytes: 1474239 num_examples: 2270 download_size: 21197271 dataset_size: 16612273 - config_name: mod_dev_multi features: - name: id dtype: string - name: utterance_id dtype: string - name: source_url dtype: string - name: snippet dtype: string - name: question dtype: string - name: scenario dtype: string - name: history list: - name: follow_up_question dtype: string - name: follow_up_answer dtype: string - name: evidence list: - name: follow_up_question dtype: string - name: follow_up_answer dtype: string - name: answer dtype: string - name: all_answers sequence: string splits: - name: validation num_bytes: 1553940 num_examples: 2270 download_size: 2006124 dataset_size: 1553940 - config_name: history features: - name: id dtype: string - name: utterance_id dtype: string - name: source_url dtype: string - name: snippet dtype: string - name: question dtype: string - name: scenario dtype: string - name: history list: - name: follow_up_question dtype: string - name: follow_up_answer dtype: string - name: evidence list: - name: follow_up_question dtype: string - name: follow_up_answer dtype: string - name: answer dtype: string splits: - name: train num_bytes: 15083103 num_examples: 21890 - name: validation num_bytes: 1468604 num_examples: 2270 download_size: 21136658 dataset_size: 16551707 - config_name: history_dev_multi features: - name: id dtype: string - name: utterance_id dtype: string - name: source_url dtype: string - name: snippet dtype: string - name: question dtype: string - name: scenario dtype: string - name: history list: - name: follow_up_question dtype: string - name: follow_up_answer dtype: string - name: evidence list: - name: follow_up_question dtype: string - name: follow_up_answer dtype: string - name: answer dtype: string - name: all_answers sequence: string splits: - name: validation num_bytes: 1548305 num_examples: 2270 download_size: 2000489 dataset_size: 1548305 --- # Dataset Card for SharcModified ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** [More info needed] - **Repository:** [github](https://github.com/nikhilweee/neural-conv-qa) - **Paper:** [Neural Conversational QA: Learning to Reason v.s. Exploiting Patterns](https://arxiv.org/abs/1909.03759) - **Leaderboard:** [More info needed] - **Point of Contact:** [More info needed] ### Dataset Summary ShARC, a conversational QA task, requires a system to answer user questions based on rules expressed in natural language text. However, it is found that in the ShARC dataset there are multiple spurious patterns that could be exploited by neural models. SharcModified is a new dataset which reduces the patterns identified in the original dataset. To reduce the sensitivity of neural models, for each occurence of an instance conforming to any of the patterns, we automatically construct alternatives where we choose to either replace the current instance with an alternative instance which does not exhibit the pattern; or retain the original instance. The modified ShARC has two versions sharc-mod and history-shuffled. ### Supported Tasks and Leaderboards [More Information Needed] ### Languages The dataset is in english (en). ## Dataset Structure ### Data Instances Example of one instance: ``` { "annotation": { "answer": [ { "paragraph_reference": { "end": 64, "start": 35, "string": "syndactyly affecting the feet" }, "sentence_reference": { "bridge": false, "end": 64, "start": 35, "string": "syndactyly affecting the feet" } } ], "explanation_type": "single_sentence", "referential_equalities": [ { "question_reference": { "end": 40, "start": 29, "string": "webbed toes" }, "sentence_reference": { "bridge": false, "end": 11, "start": 0, "string": "Webbed toes" } } ], "selected_sentence": { "end": 67, "start": 0, "string": "Webbed toes is the common name for syndactyly affecting the feet . " } }, "example_id": 9174646170831578919, "original_nq_answers": [ { "end": 45, "start": 35, "string": "syndactyly" } ], "paragraph_text": "Webbed toes is the common name for syndactyly affecting the feet . It is characterised by the fusion of two or more digits of the feet . This is normal in many birds , such as ducks ; amphibians , such as frogs ; and mammals , such as kangaroos . In humans it is considered unusual , occurring in approximately one in 2,000 to 2,500 live births .", "question": "what is the medical term for webbed toes", "sentence_starts": [ 0, 67, 137, 247 ], "title_text": "Webbed toes", "url": "https: //en.wikipedia.org//w/index.php?title=Webbed_toes&oldid=801229780" } ``` ### Data Fields - `example_id`: a unique integer identifier that matches up with NQ - `title_text`: the title of the wikipedia page containing the paragraph - `url`: the url of the wikipedia page containing the paragraph - `question`: a natural language question string from NQ - `paragraph_text`: a paragraph string from a wikipedia page containing the answer to question - `sentence_starts`: a list of integer character offsets indicating the start of sentences in the paragraph - `original_nq_answers`: the original short answer spans from NQ - `annotation`: the QED annotation, a dictionary with the following items and further elaborated upon below: - `referential_equalities`: a list of dictionaries, one for each referential equality link annotated - `answer`: a list of dictionaries, one for each short answer span - `selected_sentence`: a dictionary representing the annotated sentence in the passage - `explanation_type`: one of "single_sentence", "multi_sentence", or "none" ### Data Splits The dataset is split into training and validation splits. | | train | validation | |--------------|------:|-----------:| | N. Instances | 7638 | 1355 | ## Dataset Creation ### Curation Rationale [More Information Needed] ### Source Data [More Information Needed] #### Initial Data Collection and Normalization [More Information Needed] #### Who are the source language producers? [More Information Needed] ### Annotations [More Information Needed] #### Annotation process [More Information Needed] #### Who are the annotators? [More Information Needed] ### Personal and Sensitive Information [More Information Needed] ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed] ### Discussion of Biases [More Information Needed] ### Other Known Limitations [More Information Needed] ## Additional Information ### Dataset Curators [More Information Needed] ### Licensing Information Unknown. ### Citation Information ``` @misc{lamm2020qed, title={QED: A Framework and Dataset for Explanations in Question Answering}, author={Matthew Lamm and Jennimaria Palomaki and Chris Alberti and Daniel Andor and Eunsol Choi and Livio Baldini Soares and Michael Collins}, year={2020}, eprint={2009.06354}, archivePrefix={arXiv}, primaryClass={cs.CL} } ``` ### Contributions Thanks to [@patil-suraj](https://github.com/patil-suraj) for adding this dataset.

提供机构：

nikhilweee

原始信息汇总

数据集概述

数据集名称: SharcModified

语言: 英语 (en)

许可证: 未知

多语言性: 单语

大小类别: 10K<n<100K

源数据集: 扩展自Sharc

任务类别: 问答

任务ID: 抽取式问答 (extractive-qa)

标签: 对话式问答

数据集结构

数据字段

id: 字符串类型
utterance_id: 字符串类型
source_url: 字符串类型
snippet: 字符串类型
question: 字符串类型
scenario: 字符串类型
history: 列表类型，包含:
- follow_up_question: 字符串类型
- follow_up_answer: 字符串类型
evidence: 列表类型，包含:
- follow_up_question: 字符串类型
- follow_up_answer: 字符串类型
answer: 字符串类型

数据分割

配置名称: mod
- 训练集:
  - 字节数: 15138034
  - 示例数: 21890
- 验证集:
  - 字节数: 1474239
  - 示例数: 2270
- 下载大小: 21197271
- 数据集大小: 16612273
配置名称: mod_dev_multi
- 验证集:
  - 字节数: 1553940
  - 示例数: 2270
- 下载大小: 2006124
- 数据集大小: 1553940
配置名称: history
- 训练集:
  - 字节数: 15083103
  - 示例数: 21890
- 验证集:
  - 字节数: 1468604
  - 示例数: 2270
- 下载大小: 21136658
- 数据集大小: 16551707
配置名称: history_dev_multi
- 验证集:
  - 字节数: 1548305
  - 示例数: 2270
- 下载大小: 2000489
- 数据集大小: 1548305

5,000+

优质数据集

54 个

任务类型

进入经典数据集