maximoss/sick_el-gr_mt

Name: maximoss/sick_el-gr_mt
Creator: maximoss
Published: 2024-05-18 17:25:51
License: 暂无描述

Hugging Face2024-05-18 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/maximoss/sick_el-gr_mt

下载链接

链接失效反馈

官方服务：

资源简介：

该数据集是SICK（Sentences Involving Compositional Knowledge）数据集的现代希腊语机器翻译版本，用于文本蕴含任务（NLI）。文本蕴含任务是一种句子对分类任务，目标是预测句子A是否蕴含、矛盾或与句子B无关。除了对句子对进行机器翻译外，数据集的其他信息（如句子对ID、标签、句子来源数据集、训练/验证/测试集划分）均与原始英文数据集保持一致。数据集以TSV格式存储，与广泛使用的XNLI数据集格式相似，并且与法文版SICK数据集兼容，可用于结合英文、希腊文和法文的多语言NLI任务。

提供机构：

maximoss

原始信息汇总

数据集卡片 for Dataset Name

数据集描述

数据集概述

该仓库包含SICK（涉及组合知识的句子）数据集的现代希腊语机器翻译版本。目标是预测文本蕴涵（句子A是否蕴含/矛盾/既不蕴含也不矛盾句子B），这是一个分类任务（给定两个句子，预测三个标签之一）。除了机器翻译句子对外，其余信息（对ID、标签、每个句子的源数据集、训练/开发/测试子集分区）与原始英语数据集保持一致。

该数据集以类似于广泛使用的XNLI数据集的TSV格式进行格式化，以便于使用。它还与法语版本的SICK兼容，如果一起使用，可以进行三语言NLI任务（英语、希腊语、法语），因为它们都保持相同的列名。

支持的任务和排行榜

该数据集可用于自然语言推理（NLI）任务，也称为识别文本蕴涵（RTE），这是一个句子对分类任务。

数据集结构

数据字段

pair_ID: 句子对ID。
sentence_A: 句子A，在其他NLI数据集中也称为前提。
sentence_B: 句子B，在其他NLI数据集中也称为假设。
entailment_label: 文本蕴涵金标签（NEUTRAL, ENTAILMENT, 或 CONTRADICTION）。
entailment_AB: A-B顺序的蕴涵标签（A_neutral_B, A_entails_B, 或 A_contradicts_B）。
entailment_BA: B-A顺序的蕴涵标签（B_neutral_A, B_entails_A, 或 B_contradicts_A）。
original_SICK_sentence_A: 来自英语源数据集的原始前提。
original_SICK_sentence_B: 来自英语源数据集的原始假设。
sentence_A_dataset: 提取原始句子A的数据集（FLICKR vs. SEMEVAL）。
sentence_B_dataset: 提取原始句子B的数据集（FLICKR vs. SEMEVAL）。

数据拆分

name	Entailment	Neutral	Contradiction	Total
train	1274	2524	641	4439
validation	143	281	71	495
test	1404	2790	712	4906

对于A-B顺序：

name	A_entails_B	A_neutral_B	A_contradicts_B
train	1274	2381	784
validation	143	266	86
test	1404	2621	881

对于B-A顺序：

name	B_entails_A	B_neutral_A	B_contradicts_A
train	606	3072	761
validation	84	329	82
test	610	3431	865

数据集创建

该数据集是从英语机器翻译到现代希腊语的，使用的是最新的神经机器翻译opus-mt-tc-big模型，该模型适用于现代希腊语。句子翻译于2023年11月26日进行。

附加信息

引用信息

BibTeX:

BibTeX @inproceedings{skandalis-etal-2024-new-datasets, title = "New Datasets for Automatic Detection of Textual Entailment and of Contradictions between Sentences in {F}rench", author = "Skandalis, Maximos and Moot, Richard and Retor{e}, Christian and Robillard, Simon", editor = "Calzolari, Nicoletta and Kan, Min-Yen and Hoste, Veronique and Lenci, Alessandro and Sakti, Sakriani and Xue, Nianwen", booktitle = "Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)", month = may, year = "2024", address = "Torino, Italy", publisher = "ELRA and ICCL", url = "https://aclanthology.org/2024.lrec-main.1065", pages = "12173--12186", abstract = "This paper introduces DACCORD, an original dataset in French for automatic detection of contradictions between sentences. It also presents new, manually translated versions of two datasets, namely the well known dataset RTE3 and the recent dataset GQNLI, from English to French, for the task of natural language inference / recognising textual entailment, which is a sentence-pair classification task. These datasets help increase the admittedly limited number of datasets in French available for these tasks. DACCORD consists of 1034 pairs of sentences and is the first dataset exclusively dedicated to this task and covering among others the topic of the Russian invasion in Ukraine. RTE3-FR contains 800 examples for each of its validation and test subsets, while GQNLI-FR is composed of 300 pairs of sentences and focuses specifically on the use of generalised quantifiers. Our experiments on these datasets show that they are more challenging than the two already existing datasets for the mainstream NLI task in French (XNLI, FraCaS). For languages other than English, most deep learning models for NLI tasks currently have only XNLI available as a training set. Additional datasets, such as ours for French, could permit different training and evaluation strategies, producing more robust results and reducing the inevitable biases present in any single dataset.", }

@inproceedings{marelli-etal-2014-sick, title = "A {SICK} cure for the evaluation of compositional distributional semantic models", author = "Marelli, Marco and Menini, Stefano and Baroni, Marco and Bentivogli, Luisa and Bernardi, Raffaella and Zamparelli, Roberto", editor = "Calzolari, Nicoletta and Choukri, Khalid and Declerck, Thierry and Loftsson, Hrafn and Maegaard, Bente and Mariani, Joseph and Moreno, Asuncion and Odijk, Jan and Piperidis, Stelios", booktitle = "Proceedings of the Ninth International Conference on Language Resources and Evaluation ({LREC}14)", month = may, year = "2014", address = "Reykjavik, Iceland", publisher = "European Language Resources Association (ELRA)", url = "http://www.lrec-conf.org/proceedings/lrec2014/pdf/363_Paper.pdf", pages = "216--223", abstract = "Shared and internationally recognized benchmarks are fundamental for the development of any computational system. We aim to help the research community working on compositional distributional semantic models (CDSMs) by providing SICK (Sentences Involving Compositional Knowldedge), a large size English benchmark tailored for them. SICK consists of about 10,000 English sentence pairs that include many examples of the lexical, syntactic and semantic phenomena that CDSMs are expected to account for, but do not require dealing with other aspects of existing sentential data sets (idiomatic multiword expressions, named entities, telegraphic language) that are not within the scope of CDSMs. By means of crowdsourcing techniques, each pair was annotated for two crucial semantic tasks: relatedness in meaning (with a 5-point rating scale as gold score) and entailment relation between the two elements (with three possible gold labels: entailment, contradiction, and neutral). The SICK data set was used in SemEval-2014 Task 1, and it freely available for research purposes.", }

ACL:

Maximos Skandalis, Richard Moot, Christian Retoré, and Simon Robillard. 2024. New Datasets for Automatic Detection of Textual Entailment and of Contradictions between Sentences in French. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 12173–12186, Torino, Italy. ELRA and ICCL.

And

Marco Marelli, Stefano Menini, Marco Baroni, Luisa Bentivogli, Raffaella Bernardi, and Roberto Zamparelli. 2014. A SICK cure for the evaluation of compositional distributional semantic models. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC14), pages 216–223, Reykjavik, Iceland. European Language Resources Association (ELRA).

5,000+

优质数据集

54 个

任务类型

进入经典数据集