CORD19STS

Name: CORD19STS
Creator: 美国南加州大学信息科学研究所
Published: 2020-11-03 03:28:15
License: 暂无描述

arXiv2020-11-03 更新2024-08-06 收录

下载链接：

http://arxiv.org/abs/2007.02461v2

下载链接

链接失效反馈

官方服务：

资源简介：

CORD19STS是一个专为COVID-19设计的语义文本相似性数据集，由美国南加州大学信息科学研究所创建。该数据集包含13,710个标注的句子对，这些句子对是从COVID-19开放研究数据集（CORD19）挑战中收集的。数据集的创建过程包括数据收集和标注，通过不同的抽样策略从CORD19中抽取了100万个句子对，并使用Sen-SCI-CORD19-BERT语言模型计算相似性分数，最终筛选出32,000个句子对。每个句子对由五名亚马逊Mechanical Turk（AMT）众包工作者标注，标注标签代表句子对之间的不同语义相似性级别。CORD19STS数据集旨在为COVID-19领域的语义文本相似性任务提供新的数据资源，支持开发对话医疗诊断系统和信息检索引擎等自然语言处理应用。

CORD19STS is a semantic text similarity dataset specifically designed for COVID-19, created by the Information Sciences Institute of the University of Southern California. This dataset contains 13,710 annotated sentence pairs collected from the COVID-19 Open Research Dataset (CORD-19) Challenge. The dataset construction process includes data collection and annotation: initially, 1 million sentence pairs were sampled from CORD-19 using various sampling strategies, and similarity scores were calculated with the Sen-SCI-CORD19-BERT language model, before finally filtering down to 32,000 sentence pairs. Each sentence pair was annotated by five Amazon Mechanical Turk (AMT) crowdworkers, and the annotation labels represent different levels of semantic similarity between the sentence pairs. The CORD19STS dataset aims to provide a novel data resource for semantic text similarity tasks in the COVID-19 domain, supporting the development of natural language processing applications such as conversational medical diagnostic systems and information retrieval engines.

提供机构：

美国南加州大学信息科学研究所

创建时间：

2020-07-06

5,000+

优质数据集

54 个

任务类型

进入经典数据集