erhwenkuo/squad-cmrc2018-zhtw

Name: erhwenkuo/squad-cmrc2018-zhtw
Creator: erhwenkuo
Published: 2023-10-15 04:52:32
License: 暂无描述

Hugging Face2023-10-15 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/erhwenkuo/squad-cmrc2018-zhtw

下载链接

链接失效反馈

官方服务：

资源简介：

该数据集是基于CMRC 2018的中文机器阅读理解数据集，主要用于跨度提取任务，以增加该领域的语言多样性。数据集由人类专家在维基百科段落上注释的近20,000个真实问题组成，并包含一个挑战集，其中包含需要在整个上下文中进行全面理解和多句推理的问题。数据集包含训练、验证和测试三个分割，分别有10,142、3,219和1,002个样本。每个样本包含id、context、question和answers四个字段，其中answers字段包含text和answer_start两个子字段。

This dataset is a Chinese machine reading comprehension dataset based on CMRC 2018, primarily utilized for span extraction tasks to augment linguistic diversity in this field. Comprising nearly 20,000 real questions annotated by human experts on Wikipedia paragraphs, the dataset also features a challenge set containing questions that require comprehensive contextual understanding and multi-sentence reasoning. The dataset is split into three subsets: training, validation, and test, with 10,142, 3,219, and 1,002 samples respectively. Each sample contains four fields: id, context, question, and answers, where the answers field includes two sub-fields: text and answer_start.

提供机构：

erhwenkuo

原始信息汇总

数据集概述

数据集摘要

CMRC 2018 是第二届「讯飞杯」中文机器阅读理解颁奖研讨会（CMRC 2018）中相关竞赛所使用的数据集。

它主要用于中文机器阅读理解的跨度提取数据集，以增加该领域的语言多样性。该数据集由人类专家在维基百科段落上注释的近 20,000 个真实问题组成。

同时它也注释了一个挑战集，其中包含需要在整個上下文中进行全面理解和多句推理的问题。

数据集结构

数据文件配置

train: data/train-*
validation: data/validation-*
test: data/test-*

数据特征

id: 字符串类型，编号
context: 字符串类型，问题内容的上下文
question: 字符串类型，问题
answers: 问题回答（基于内容的上下文来提取）
- text: 字符串列表，问题的答案
- answer_start: 整数列表，问题的答案位于 context 上下文中的位置

数据分割

train: 10,142 条数据
validation: 3,219 条数据
test: 1,002 条数据

数据集大小

下载大小: 4781898 字节
数据集大小: 21350661 字节

许可信息

CC BY-SA 4.0

论文引用

@inproceedings{cui-emnlp2019-cmrc2018, title = "A Span-Extraction Dataset for {C}hinese Machine Reading Comprehension", author = "Cui, Yiming and Liu, Ting and Che, Wanxiang and Xiao, Li and Chen, Zhipeng and Ma, Wentao and Wang, Shijin and Hu, Guoping", booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)", month = nov, year = "2019", address = "Hong Kong, China", publisher = "Association for Computational Linguistics", url = "https://www.aclweb.org/anthology/D19-1600", doi = "10.18653/v1/D19-1600", pages = "5886--5891", }

搜集汇总

数据集介绍

背景与挑战

背景概述

erhwenkuo/squad-cmrc2018-zhtw是一个中文机器阅读理解数据集，包含近20,000个真实问题，用于跨度提取任务。数据集分为训练集、验证集和测试集，数据格式包括上下文、问题和答案，适用于自然语言处理研究和应用。

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集