COSTRA 1.0

Name: COSTRA 1.0
Creator: 查尔斯大学数学与物理学院形式与应用语言学研究所
Published: 2020-04-16 15:32:00
License: 暂无描述

arXiv2020-04-16 更新2024-06-21 收录

下载链接：

http://hdl.handle.net/11234/1-3123

下载链接

链接失效反馈

官方服务：

资源简介：

COSTRA 1.0是由查尔斯大学数学与物理学院形式与应用语言学研究所创建的一个复杂句子变换数据集，旨在研究句子级嵌入的深层语义和句法关系。该数据集包含4,262个独特的捷克语句子，平均长度为10个单词，展示了15种类型的修改，如简化、概括或正式与非正式语言变体。数据集的创建过程涉及两轮注释，第一轮收集创新的想法，第二轮基于这些想法收集数据。COSTRA 1.0的应用领域包括测试句子嵌入的语义属性，探索句子嵌入空间的拓扑结构，以及寻找句子间的清晰、‘正交’关系。

COSTRA 1.0 is a complex sentence transformation dataset developed by the Institute of Formal and Applied Linguistics, Faculty of Mathematics and Physics, Charles University, aiming to investigate the deep semantic and syntactic relations of sentence-level embeddings. This dataset contains 4,262 unique Czech sentences, with an average length of 10 words, and exhibits 15 types of modifications such as simplification, generalization, and formal and informal language variants. The dataset construction process involves two rounds of annotation: the first round collects innovative ideas, and the second round collects data based on these ideas. Application domains of COSTRA 1.0 include testing the semantic properties of sentence embeddings, exploring the topological structure of sentence embedding spaces, and identifying clear, 'orthogonal' relations between sentences.

提供机构：

查尔斯大学数学与物理学院形式与应用语言学研究所

创建时间：

2019-12-04

5,000+

优质数据集

54 个

任务类型

进入经典数据集