emrecan/all-nli-tr

Name: emrecan/all-nli-tr
Creator: emrecan
Published: 2024-06-16 22:14:24
License: 暂无描述

Hugging Face2024-06-16 更新2024-06-29 收录

下载链接：

https://hf-mirror.com/datasets/emrecan/all-nli-tr

下载链接

链接失效反馈

官方服务：

资源简介：

该数据集是NLI-TR数据集的格式化版本，旨在与Sentence Transformers的AllNLI格式保持一致，以便于训练。尽管最初用于自然语言推理（NLI），但该数据集也可用于训练/微调嵌入模型以进行语义文本相似性任务。数据集包含四个子集：pair-class、pair-score、pair和triplet，每个子集都有不同的列和数据类型，并且都经过了去重处理。

This dataset is a formatted version of NLI-TR datasets, sharing the same licenses. The format is intended to be in line with AllNLI by Sentence Transformers for ease of training. Despite originally being intended for Natural Language Inference (NLI), this dataset can be used for training/finetuning an embedding model for semantic textual similarity. The dataset contains four subsets: pair-class, pair-score, pair, and triplet, each with different columns and data types, and all have been deduplicated.

提供机构：

emrecan

原始信息汇总

数据集概述

基本信息

语言: 土耳其语
许可证:
- CC-BY-3.0
- CC-BY-4.0
- CC-BY-SA-3.0
- MIT
- Other
多语言性: 单语种
数据集大小: 1M < n < 10M
任务类别:
- 特征提取
- 句子相似度
数据集名称: AllNLITR

数据集配置

配置 `pair`

特征:
- anchor: 字符串
- positive: 字符串
分割:
- train: 313601 个样本
- dev: 6802 个样本
- test: 6827 个样本

配置 `pair-class`

特征:
- premise: 字符串
- hypothesis: 字符串
- label: 类别标签
  - 0: entailment
  - 1: neutral
  - 2: contradiction
分割:
- train: 941086 个样本
- dev: 19649 个样本
- test: 19652 个样本

配置 `pair-score`

特征:
- sentence1: 字符串
- sentence2: 字符串
- score: 浮点数
分割:
- train: 941086 个样本
- dev: 19649 个样本
- test: 19652 个样本

配置 `triplet`

特征:
- anchor: 字符串
- positive: 字符串
- negative: 字符串
分割:
- train: 482091 个样本
- dev: 6567 个样本
- test: 6587 个样本

数据集子集

`pair-class` 子集

列: "premise", "hypothesis", "label"
列类型: str, str, class
- {"0": "entailment", "1": "neutral", "2": "contradiction"}
示例: python { premise: A person on a horse jumps over a broken down airplane., hypothesis: A person is training his horse for a competition., label: 1, }
收集策略: 从 SNLI & MultiNLI 数据集中读取 premise、hypothesis 和整数标签。
去重: 是

`pair-score` 子集

列: "sentence1", "sentence2", "score"
列类型: str, str, float
示例: python { sentence1: A person on a horse jumps over a broken down airplane., sentence2: A person is training his horse for a competition., score: 0.5, }
收集策略: 从 pair-class 子集中获取，并将 "entailment"、"neutral" 和 "contradiction" 分别映射到 1.0、0.5 和 0.0。
去重: 是

`pair` 子集

列: "anchor", "positive"
列类型: str, str
示例: python { anchor: A person on a horse jumps over a broken down airplane., positive: A person is training his horse for a competition., }
收集策略: 从 SNLI & MultiNLI 数据集中读取，将 "premise" 作为 "anchor"，将 "hypothesis" 作为 "positive"（如果标签为 "entailment"）。不包括反向（"entailment" 作为 "anchor"，"premise" 作为 "positive"）。
去重: 是

`triplet` 子集

列: "anchor", "positive", "negative"
列类型: str, str, str
示例: python { anchor: A person on a horse jumps over a broken down airplane., positive: A person is outdoors, on a horse., negative: A person is at a diner, ordering an omelette., }
收集策略: 从 SNLI & MultiNLI 数据集中读取，为每个 "premise" 生成一个包含 entailing 和 contradictory 句子的列表，然后从这些列表中生成所有可能的三元组。不包括反向（"entailment" 作为 "anchor"，"premise" 作为 "positive"）。
去重: 是

引用信息

@inproceedings{budur-etal-2020-data, title = "Data and Representation for Turkish Natural Language Inference", author = "Budur, Emrah and "{O}zçelik, Rıza and G"{u}ng"{o}r, Tunga", booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)", month = nov, year = "2020", address = "Online", publisher = "Association for Computational Linguistics", abstract = "Large annotated datasets in NLP are overwhelmingly in English. This is an obstacle to progress in other languages. Unfortunately, obtaining new annotated resources for each task in each language would be prohibitively expensive. At the same time, commercial machine translation systems are now robust. Can we leverage these systems to translate English-language datasets automatically? In this paper, we offer a positive response for natural language inference (NLI) in Turkish. We translated two large English NLI datasets into Turkish and had a team of experts validate their translation quality and fidelity to the original labels. Using these datasets, we address core issues of representation for Turkish NLI. We find that in-language embeddings are essential and that morphological parsing can be avoided where the training set is large. Finally, we show that models trained on our machine-translated datasets are successful on human-translated evaluation sets. We share all code, models, and data publicly.", }

搜集汇总

数据集介绍

背景与挑战

背景概述

该数据集是一个土耳其语的自然语言推理（NLI）数据集，基于SNLI和MultiNLI数据集通过机器翻译和专家验证构建，包含多个子集（如pair、triplet）以支持特征提取和句子相似性任务。数据集规模在1M到10M之间，格式为csv，旨在训练或微调嵌入模型，用于语义文本相似性应用，并已有多模型在其上训练或微调。

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集

emrecan/all-nli-tr

数据集概述

基本信息

数据集配置

配置 pair

配置 pair-class

配置 pair-score

配置 triplet