emrecan/all-nli-tr
收藏数据集概述
基本信息
- 语言: 土耳其语
- 许可证:
- CC-BY-3.0
- CC-BY-4.0
- CC-BY-SA-3.0
- MIT
- Other
- 多语言性: 单语种
- 数据集大小: 1M < n < 10M
- 任务类别:
- 特征提取
- 句子相似度
- 数据集名称: AllNLITR
数据集配置
配置 pair
- 特征:
anchor: 字符串positive: 字符串
- 分割:
train: 313601 个样本dev: 6802 个样本test: 6827 个样本
配置 pair-class
- 特征:
premise: 字符串hypothesis: 字符串label: 类别标签0: entailment1: neutral2: contradiction
- 分割:
train: 941086 个样本dev: 19649 个样本test: 19652 个样本
配置 pair-score
- 特征:
sentence1: 字符串sentence2: 字符串score: 浮点数
- 分割:
train: 941086 个样本dev: 19649 个样本test: 19652 个样本
配置 triplet
- 特征:
anchor: 字符串positive: 字符串negative: 字符串
- 分割:
train: 482091 个样本dev: 6567 个样本test: 6587 个样本
数据集子集
pair-class 子集
-
列: "premise", "hypothesis", "label"
-
列类型:
str,str,class{"0": "entailment", "1": "neutral", "2": "contradiction"}
-
示例: python { premise: A person on a horse jumps over a broken down airplane., hypothesis: A person is training his horse for a competition., label: 1, }
-
收集策略: 从 SNLI & MultiNLI 数据集中读取 premise、hypothesis 和整数标签。
-
去重: 是
pair-score 子集
-
列: "sentence1", "sentence2", "score"
-
列类型:
str,str,float -
示例: python { sentence1: A person on a horse jumps over a broken down airplane., sentence2: A person is training his horse for a competition., score: 0.5, }
-
收集策略: 从
pair-class子集中获取,并将 "entailment"、"neutral" 和 "contradiction" 分别映射到 1.0、0.5 和 0.0。 -
去重: 是
pair 子集
-
列: "anchor", "positive"
-
列类型:
str,str -
示例: python { anchor: A person on a horse jumps over a broken down airplane., positive: A person is training his horse for a competition., }
-
收集策略: 从 SNLI & MultiNLI 数据集中读取,将 "premise" 作为 "anchor",将 "hypothesis" 作为 "positive"(如果标签为 "entailment")。不包括反向("entailment" 作为 "anchor","premise" 作为 "positive")。
-
去重: 是
triplet 子集
-
列: "anchor", "positive", "negative"
-
列类型:
str,str,str -
示例: python { anchor: A person on a horse jumps over a broken down airplane., positive: A person is outdoors, on a horse., negative: A person is at a diner, ordering an omelette., }
-
收集策略: 从 SNLI & MultiNLI 数据集中读取,为每个 "premise" 生成一个包含 entailing 和 contradictory 句子的列表,然后从这些列表中生成所有可能的三元组。不包括反向("entailment" 作为 "anchor","premise" 作为 "positive")。
-
去重: 是
引用信息
@inproceedings{budur-etal-2020-data, title = "Data and Representation for Turkish Natural Language Inference", author = "Budur, Emrah and "{O}zçelik, Rıza and G"{u}ng"{o}r, Tunga", booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)", month = nov, year = "2020", address = "Online", publisher = "Association for Computational Linguistics", abstract = "Large annotated datasets in NLP are overwhelmingly in English. This is an obstacle to progress in other languages. Unfortunately, obtaining new annotated resources for each task in each language would be prohibitively expensive. At the same time, commercial machine translation systems are now robust. Can we leverage these systems to translate English-language datasets automatically? In this paper, we offer a positive response for natural language inference (NLI) in Turkish. We translated two large English NLI datasets into Turkish and had a team of experts validate their translation quality and fidelity to the original labels. Using these datasets, we address core issues of representation for Turkish NLI. We find that in-language embeddings are essential and that morphological parsing can be avoided where the training set is large. Finally, we show that models trained on our machine-translated datasets are successful on human-translated evaluation sets. We share all code, models, and data publicly.", }




