maximoss/lingnli-multi-mt
收藏数据集卡片
数据集描述
数据集摘要
该数据集包含LingNLI数据集的机器翻译版本,涵盖9种不同语言(保加利亚语、芬兰语、法语、希腊语、意大利语、韩语、立陶宛语、葡萄牙语、西班牙语)。目标是预测文本蕴含(句子A是否暗示/矛盾/既不暗示也不矛盾句子B),这是一个分类任务(给定两个句子,预测三个标签之一)。数据集格式与广泛使用的XNLI数据集相同,便于使用。
支持的任务和排行榜
该数据集可用于自然语言推理(NLI)任务,也称为识别文本蕴含(RTE),这是一个句子对分类任务。
数据集结构
数据字段
language: 句子对所用的语言。premise: 目标语言中的机器翻译前提。hypothesis: 目标语言中的机器翻译假设。label: 分类标签,可能值为0(entailment)、1(neutral)、2(contradiction)。label_text: 分类标签,可能值为entailment(0)、neutral(1)、contradiction(2)。premise_original: 英语源数据集中的原始前提。hypothesis_original: 英语源数据集中的原始假设。
数据分割
整个数据集(LitL和LotS子集):
| 语言 | 训练集 | 验证集 |
|---|---|---|
| all_languages | 269865 | 44037 |
| el-gr | 29985 | 4893 |
| fr | 29985 | 4893 |
| it | 29985 | 4893 |
| es | 29985 | 4893 |
| pt | 29985 | 4893 |
| ko | 29985 | 4893 |
| fi | 29985 | 4893 |
| lt | 29985 | 4893 |
| bg | 29985 | 4893 |
LitL子集:
| 语言 | 训练集 | 验证集 |
|---|---|---|
| all_languages | 134955 | 21825 |
| el-gr | 14995 | 2425 |
| fr | 14995 | 2425 |
| it | 14995 | 2425 |
| es | 14995 | 2425 |
| pt | 14995 | 2425 |
| ko | 14995 | 2425 |
| fi | 14995 | 2425 |
| lt | 14995 | 2425 |
| bg | 14995 | 2425 |
LotS子集:
| 语言 | 训练集 | 验证集 |
|---|---|---|
| all_languages | 134910 | 22212 |
| el-gr | 14990 | 2468 |
| fr | 14990 | 2468 |
| it | 14990 | 2468 |
| es | 14990 | 2468 |
| pt | 14990 | 2468 |
| ko | 14990 | 2468 |
| fi | 14990 | 2468 |
| lt | 14990 | 2468 |
| bg | 14990 | 2468 |
数据集创建
原始数据集的两个子集使用最新的神经机器翻译opus-mt-tc-big模型进行了机器翻译,翻译工作从2023年3月25日持续到2023年4月8日。
附加信息
引用信息
BibTeX:
BibTeX @inproceedings{skandalis-etal-2024-new-datasets, title = "New Datasets for Automatic Detection of Textual Entailment and of Contradictions between Sentences in {F}rench", author = "Skandalis, Maximos and Moot, Richard and Retor{e}, Christian and Robillard, Simon", editor = "Calzolari, Nicoletta and Kan, Min-Yen and Hoste, Veronique and Lenci, Alessandro and Sakti, Sakriani and Xue, Nianwen", booktitle = "Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)", month = may, year = "2024", address = "Torino, Italy", publisher = "ELRA and ICCL", url = "https://aclanthology.org/2024.lrec-main.1065", pages = "12173--12186", abstract = "This paper introduces DACCORD, an original dataset in French for automatic detection of contradictions between sentences. It also presents new, manually translated versions of two datasets, namely the well known dataset RTE3 and the recent dataset GQNLI, from English to French, for the task of natural language inference / recognising textual entailment, which is a sentence-pair classification task. These datasets help increase the admittedly limited number of datasets in French available for these tasks. DACCORD consists of 1034 pairs of sentences and is the first dataset exclusively dedicated to this task and covering among others the topic of the Russian invasion in Ukraine. RTE3-FR contains 800 examples for each of its validation and test subsets, while GQNLI-FR is composed of 300 pairs of sentences and focuses specifically on the use of generalised quantifiers. Our experiments on these datasets show that they are more challenging than the two already existing datasets for the mainstream NLI task in French (XNLI, FraCaS). For languages other than English, most deep learning models for NLI tasks currently have only XNLI available as a training set. Additional datasets, such as ours for French, could permit different training and evaluation strategies, producing more robust results and reducing the inevitable biases present in any single dataset.", }
@inproceedings{parrish-etal-2021-putting-linguist, title = "Does Putting a Linguist in the Loop Improve {NLU} Data Collection?", author = "Parrish, Alicia and Huang, William and Agha, Omar and Lee, Soo-Hwan and Nangia, Nikita and Warstadt, Alexia and Aggarwal, Karmanya and Allaway, Emily and Linzen, Tal and Bowman, Samuel R.", booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2021", month = nov, year = "2021", address = "Punta Cana, Dominican Republic", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2021.findings-emnlp.421", doi = "10.18653/v1/2021.findings-emnlp.421", pages = "4886--4901", abstract = "Many crowdsourced NLP datasets contain systematic artifacts that are identified only after data collection is complete. Earlier identification of these issues should make it easier to create high-quality training and evaluation data. We attempt this by evaluating protocols in which expert linguists work {`}in the loop{} during data collection to identify and address these issues by adjusting task instructions and incentives. Using natural language inference as a test case, we compare three data collection protocols: (i) a baseline protocol with no linguist involvement, (ii) a linguist-in-the-loop intervention with iteratively-updated constraints on the writing task, and (iii) an extension that adds direct interaction between linguists and crowdworkers via a chatroom. We find that linguist involvement does not lead to increased accuracy on out-of-domain test sets compared to baseline, and adding a chatroom has no effect on the data. Linguist involvement does, however, lead to more challenging evaluation data and higher accuracy on some challenge sets, demonstrating the benefits of integrating expert analysis during data collection.", }
ACL:
Maximos Skandalis, Richard Moot, Christian Retoré, and Simon Robillard. 2024. New Datasets for Automatic Detection of Textual Entailment and of Contradictions between Sentences in French. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 12173–12186, Torino, Italy. ELRA and ICCL.
And
Alicia Parrish, William Huang, Omar Agha, Soo-Hwan Lee, Nikita Nangia, Alexia Warstadt, Karmanya Aggarwal, Emily Allaway, Tal Linzen, and Samuel R. Bowman. 2021. Does Putting a Linguist in the Loop Improve NLU Data Collection?. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 4886–4901, Punta Cana, Dominican Republic. Association for Computational Linguistics.




