短跨度提取任务数据集，长跨度提取任务数据集，短填空任务数据集，长填空任务数据集

Name: 短跨度提取任务数据集，长跨度提取任务数据集，短填空任务数据集，长填空任务数据集
Creator: 中国科学院成都计算机应用研究所
Published: 2021-09-29 12:07:05
License: 暂无描述

arXiv2021-09-29 更新2024-08-06 收录

下载链接：

http://arxiv.org/abs/2110.15712v1

下载链接

链接失效反馈

官方服务：

资源简介：

本研究创建了四个中文MRC数据集，分别针对短跨度提取、长跨度提取、短填空和长填空任务。这些数据集源自中文SQuAD和NLPCC2017语料库，通过机器翻译和人工校正确保数据质量。每个数据集都根据答案长度进行了细分，以适应不同任务的需求。创建过程中，采用了统一的预训练参数和遮蔽策略，确保数据集的一致性和适用性。这些数据集主要用于评估不同遮蔽长度对MLM模型在MRC任务中性能的影响，旨在优化模型预训练策略，提高机器阅读理解能力。

This study constructs four Chinese Machine Reading Comprehension (MRC) datasets tailored for four specific tasks: short-span extraction, long-span extraction, short cloze test, and long cloze test respectively. These datasets are derived from the Chinese SQuAD and NLPCC 2017 corpora, with data quality guaranteed via machine translation followed by manual proofreading. Each dataset is subdivided based on answer length to meet the requirements of corresponding tasks. During the dataset construction phase, unified pre-training parameters and masking strategies were employed to ensure the consistency and applicability of all datasets. These datasets are primarily used to evaluate the impact of varying masking lengths on the performance of Masked Language Model (MLM) in MRC tasks, aiming to optimize model pre-training strategies and enhance machine reading comprehension capabilities.

提供机构：

中国科学院成都计算机应用研究所

创建时间：

2021-09-29