maximoss/lingnli-multi-mt

Name: maximoss/lingnli-multi-mt
Creator: maximoss
Published: 2024-05-18 17:26:34
License: 暂无描述

Hugging Face2024-05-18 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/maximoss/lingnli-multi-mt

下载链接

链接失效反馈

官方服务：

资源简介：

该数据集包含LingNLI数据集的机器翻译版本，支持9种语言（保加利亚语、芬兰语、法语、希腊语、意大利语、韩语、立陶宛语、葡萄牙语、西班牙语），用于自然语言推理任务（即判断句子A是否蕴含/矛盾/中立于句子B）。数据集的结构与广泛使用的XNLI数据集相同，方便使用。用户可以根据需要选择特定语言的数据。数据集的创建过程使用了最新的神经机器翻译模型，并在2023年3月25日至4月8日期间完成。

提供机构：

maximoss

原始信息汇总

数据集卡片

数据集描述

数据集摘要

该数据集包含LingNLI数据集的机器翻译版本，涵盖9种不同语言（保加利亚语、芬兰语、法语、希腊语、意大利语、韩语、立陶宛语、葡萄牙语、西班牙语）。目标是预测文本蕴含（句子A是否暗示/矛盾/既不暗示也不矛盾句子B），这是一个分类任务（给定两个句子，预测三个标签之一）。数据集格式与广泛使用的XNLI数据集相同，便于使用。

支持的任务和排行榜

该数据集可用于自然语言推理（NLI）任务，也称为识别文本蕴含（RTE），这是一个句子对分类任务。

数据集结构

数据字段

language: 句子对所用的语言。
premise: 目标语言中的机器翻译前提。
hypothesis: 目标语言中的机器翻译假设。
label: 分类标签，可能值为0（entailment）、1（neutral）、2（contradiction）。
label_text: 分类标签，可能值为entailment（0）、neutral（1）、contradiction（2）。
premise_original: 英语源数据集中的原始前提。
hypothesis_original: 英语源数据集中的原始假设。

数据分割

整个数据集（LitL和LotS子集）：

语言	训练集	验证集
all_languages	269865	44037
el-gr	29985	4893
fr	29985	4893
it	29985	4893
es	29985	4893
pt	29985	4893
ko	29985	4893
fi	29985	4893
lt	29985	4893
bg	29985	4893

LitL子集：

语言	训练集	验证集
all_languages	134955	21825
el-gr	14995	2425
fr	14995	2425
it	14995	2425
es	14995	2425
pt	14995	2425
ko	14995	2425
fi	14995	2425
lt	14995	2425
bg	14995	2425

LotS子集：

语言	训练集	验证集
all_languages	134910	22212
el-gr	14990	2468
fr	14990	2468
it	14990	2468
es	14990	2468
pt	14990	2468
ko	14990	2468
fi	14990	2468
lt	14990	2468
bg	14990	2468

数据集创建

原始数据集的两个子集使用最新的神经机器翻译opus-mt-tc-big模型进行了机器翻译，翻译工作从2023年3月25日持续到2023年4月8日。

附加信息

引用信息

BibTeX:

BibTeX @inproceedings{skandalis-etal-2024-new-datasets, title = "New Datasets for Automatic Detection of Textual Entailment and of Contradictions between Sentences in {F}rench", author = "Skandalis, Maximos and Moot, Richard and Retor{e}, Christian and Robillard, Simon", editor = "Calzolari, Nicoletta and Kan, Min-Yen and Hoste, Veronique and Lenci, Alessandro and Sakti, Sakriani and Xue, Nianwen", booktitle = "Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)", month = may, year = "2024", address = "Torino, Italy", publisher = "ELRA and ICCL", url = "https://aclanthology.org/2024.lrec-main.1065", pages = "12173--12186", abstract = "This paper introduces DACCORD, an original dataset in French for automatic detection of contradictions between sentences. It also presents new, manually translated versions of two datasets, namely the well known dataset RTE3 and the recent dataset GQNLI, from English to French, for the task of natural language inference / recognising textual entailment, which is a sentence-pair classification task. These datasets help increase the admittedly limited number of datasets in French available for these tasks. DACCORD consists of 1034 pairs of sentences and is the first dataset exclusively dedicated to this task and covering among others the topic of the Russian invasion in Ukraine. RTE3-FR contains 800 examples for each of its validation and test subsets, while GQNLI-FR is composed of 300 pairs of sentences and focuses specifically on the use of generalised quantifiers. Our experiments on these datasets show that they are more challenging than the two already existing datasets for the mainstream NLI task in French (XNLI, FraCaS). For languages other than English, most deep learning models for NLI tasks currently have only XNLI available as a training set. Additional datasets, such as ours for French, could permit different training and evaluation strategies, producing more robust results and reducing the inevitable biases present in any single dataset.", }

@inproceedings{parrish-etal-2021-putting-linguist, title = "Does Putting a Linguist in the Loop Improve {NLU} Data Collection?", author = "Parrish, Alicia and Huang, William and Agha, Omar and Lee, Soo-Hwan and Nangia, Nikita and Warstadt, Alexia and Aggarwal, Karmanya and Allaway, Emily and Linzen, Tal and Bowman, Samuel R.", booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2021", month = nov, year = "2021", address = "Punta Cana, Dominican Republic", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2021.findings-emnlp.421", doi = "10.18653/v1/2021.findings-emnlp.421", pages = "4886--4901", abstract = "Many crowdsourced NLP datasets contain systematic artifacts that are identified only after data collection is complete. Earlier identification of these issues should make it easier to create high-quality training and evaluation data. We attempt this by evaluating protocols in which expert linguists work {`}in the loop{} during data collection to identify and address these issues by adjusting task instructions and incentives. Using natural language inference as a test case, we compare three data collection protocols: (i) a baseline protocol with no linguist involvement, (ii) a linguist-in-the-loop intervention with iteratively-updated constraints on the writing task, and (iii) an extension that adds direct interaction between linguists and crowdworkers via a chatroom. We find that linguist involvement does not lead to increased accuracy on out-of-domain test sets compared to baseline, and adding a chatroom has no effect on the data. Linguist involvement does, however, lead to more challenging evaluation data and higher accuracy on some challenge sets, demonstrating the benefits of integrating expert analysis during data collection.", }

ACL:

Maximos Skandalis, Richard Moot, Christian Retoré, and Simon Robillard. 2024. New Datasets for Automatic Detection of Textual Entailment and of Contradictions between Sentences in French. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 12173–12186, Torino, Italy. ELRA and ICCL.

And

Alicia Parrish, William Huang, Omar Agha, Soo-Hwan Lee, Nikita Nangia, Alexia Warstadt, Karmanya Aggarwal, Emily Allaway, Tal Linzen, and Samuel R. Bowman. 2021. Does Putting a Linguist in the Loop Improve NLU Data Collection?. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 4886–4901, Punta Cana, Dominican Republic. Association for Computational Linguistics.

搜集汇总

数据集介绍

背景与挑战

背景概述

该数据集是LingNLI的多语言机器翻译版本，包含9种语言（如希腊语、法语、意大利语等），用于自然语言推理任务，即对句子对进行蕴含、中立或矛盾的三分类。数据集规模约为31.4万行，提供翻译后的文本和原始英文对照，旨在支持跨语言NLI研究。

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集