Czech News Dataset for Semantic Textual Similarity

Name: Czech News Dataset for Semantic Textual Similarity
Creator: 西波西米亚大学应用科学学院计算机科学与工程系
Published: 2022-01-21 18:28:54
License: 暂无描述

arXiv2022-01-21 更新2024-06-21 收录

下载链接：

https://air.kiv.zcu.cz/datasets/sts-ctk

下载链接

链接失效反馈

官方服务：

资源简介：

Czech News Dataset for Semantic Textual Similarity是由西波西米亚大学应用科学学院计算机科学与工程系创建的一个大型数据集，包含138,556条从捷克新闻领域提取的句子对，每个句子对都有两种语义相似度标注：有上下文和无上下文。该数据集旨在训练和评估预测句子语义相似度的系统。数据集的创建过程涉及485名新闻学学生的参与，并通过计算平均注释分数来提高测试集的可靠性。该数据集特别适用于研究上下文对语义相似度评估的影响，并已用于训练高性能的语义相似度预测模型。

Czech News Dataset for Semantic Textual Similarity is a large-scale dataset developed by the Department of Computer Science and Engineering, Faculty of Applied Sciences, University of West Bohemia. It contains 138,556 sentence pairs extracted from Czech news domains, with each pair having two sets of semantic similarity annotations: one evaluated with context and the other without context. This dataset is designed to train and evaluate systems that predict semantic textual similarity between sentences. The construction of the dataset involved 485 journalism students, and the reliability of the test set was improved by calculating the average annotation scores. It is particularly applicable to research on the impact of context on semantic similarity evaluation, and has been utilized to train high-performance semantic similarity prediction models.

提供机构：

西波西米亚大学应用科学学院计算机科学与工程系

创建时间：

2021-08-19

5,000+

优质数据集

54 个

任务类型

进入经典数据集