five

allegro/klej-cdsc-e

收藏
Hugging Face2022-08-30 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/allegro/klej-cdsc-e
下载链接
链接失效反馈
官方服务:
资源简介:
波兰CDSCorpus包含10K由人类标注的波兰语句对,用于语义相关性和蕴含关系(CDSC-E)的评估。该数据集可用于评估波兰语的组合分布语义模型,并在ACL 2017会议上展示。虽然数据集的设计灵感来自SICK语料库,但在细节上有所不同。与SICK类似,句子来自图像描述,但所选图像集更加多样化,因为它们来自46个主题组。
提供机构:
allegro
原始信息汇总

数据集概述

名称: CDSC-E

语言: 波兰语 (pl)

许可证: CC BY-NC-SA 4.0

多语言性: 单语

大小: 10K<n<100K

来源: 原始数据

任务类别: 文本分类

任务ID: 自然语言推理

描述

CDSC-E数据集包含10,000对波兰语句子,人工标注用于语义相关性(CDSC-R)和蕴含(CDSC-E)。该数据集用于评估波兰语的组合分布式语义模型,并在ACL 2017上展示。

任务详情

输入: 一对句子(sentence_A, sentence_B)

输出: 蕴含判断(entailment_judgment列),包含三种可能的关系:entailment, contradiction, neutral

领域: 图像标题

测量指标: 准确度

数据分割

子集 基数
训练 8000
验证 1000
测试 1000

类别分布

类别 训练 验证 测试
NEUTRAL 0.744 0.741 0.744
ENTAILMENT 0.179 0.185 0.190
CONTRADICTION 0.077 0.074 0.066

引用

@inproceedings{wroblewska-krasnowska-kieras-2017-polish, title = "{P}olish evaluation dataset for compositional distributional semantics models", author = "Wr{o}blewska, Alina and Krasnowska-Kiera{s}, Katarzyna", booktitle = "Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)", month = jul, year = "2017", address = "Vancouver, Canada", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/P17-1073", doi = "10.18653/v1/P17-1073", pages = "784--792", abstract = "The paper presents a procedure of building an evaluation dataset. for the validation of compositional distributional semantics models estimated for languages other than English. The procedure generally builds on steps designed to assemble the SICK corpus, which contains pairs of English sentences annotated for semantic relatedness and entailment, because we aim at building a comparable dataset. However, the implementation of particular building steps significantly differs from the original SICK design assumptions, which is caused by both lack of necessary extraneous resources for an investigated language and the need for language-specific transformation rules. The designed procedure is verified on Polish, a fusional language with a relatively free word order, and contributes to building a Polish evaluation dataset. The resource consists of 10K sentence pairs which are human-annotated for semantic relatedness and entailment. The dataset may be used for the evaluation of compositional distributional semantics models of Polish.", }

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作