five

emrecan/all-nli-tr

收藏
Hugging Face2024-06-16 更新2024-06-29 收录
下载链接:
https://hf-mirror.com/datasets/emrecan/all-nli-tr
下载链接
链接失效反馈
官方服务:
资源简介:
该数据集是NLI-TR数据集的格式化版本,旨在与Sentence Transformers的AllNLI格式保持一致,以便于训练。尽管最初用于自然语言推理(NLI),但该数据集也可用于训练/微调嵌入模型以进行语义文本相似性任务。数据集包含四个子集:pair-class、pair-score、pair和triplet,每个子集都有不同的列和数据类型,并且都经过了去重处理。

This dataset is a formatted version of NLI-TR datasets, sharing the same licenses. The format is intended to be in line with AllNLI by Sentence Transformers for ease of training. Despite originally being intended for Natural Language Inference (NLI), this dataset can be used for training/finetuning an embedding model for semantic textual similarity. The dataset contains four subsets: pair-class, pair-score, pair, and triplet, each with different columns and data types, and all have been deduplicated.
提供机构:
emrecan
原始信息汇总

数据集概述

基本信息

  • 语言: 土耳其语
  • 许可证:
    • CC-BY-3.0
    • CC-BY-4.0
    • CC-BY-SA-3.0
    • MIT
    • Other
  • 多语言性: 单语种
  • 数据集大小: 1M < n < 10M
  • 任务类别:
    • 特征提取
    • 句子相似度
  • 数据集名称: AllNLITR

数据集配置

配置 pair

  • 特征:
    • anchor: 字符串
    • positive: 字符串
  • 分割:
    • train: 313601 个样本
    • dev: 6802 个样本
    • test: 6827 个样本

配置 pair-class

  • 特征:
    • premise: 字符串
    • hypothesis: 字符串
    • label: 类别标签
      • 0: entailment
      • 1: neutral
      • 2: contradiction
  • 分割:
    • train: 941086 个样本
    • dev: 19649 个样本
    • test: 19652 个样本

配置 pair-score

  • 特征:
    • sentence1: 字符串
    • sentence2: 字符串
    • score: 浮点数
  • 分割:
    • train: 941086 个样本
    • dev: 19649 个样本
    • test: 19652 个样本

配置 triplet

  • 特征:
    • anchor: 字符串
    • positive: 字符串
    • negative: 字符串
  • 分割:
    • train: 482091 个样本
    • dev: 6567 个样本
    • test: 6587 个样本

数据集子集

pair-class 子集

  • : "premise", "hypothesis", "label"

  • 列类型: str, str, class

    • {"0": "entailment", "1": "neutral", "2": "contradiction"}
  • 示例: python { premise: A person on a horse jumps over a broken down airplane., hypothesis: A person is training his horse for a competition., label: 1, }

  • 收集策略: 从 SNLI & MultiNLI 数据集中读取 premise、hypothesis 和整数标签。

  • 去重: 是

pair-score 子集

  • : "sentence1", "sentence2", "score"

  • 列类型: str, str, float

  • 示例: python { sentence1: A person on a horse jumps over a broken down airplane., sentence2: A person is training his horse for a competition., score: 0.5, }

  • 收集策略: 从 pair-class 子集中获取,并将 "entailment"、"neutral" 和 "contradiction" 分别映射到 1.0、0.5 和 0.0。

  • 去重: 是

pair 子集

  • : "anchor", "positive"

  • 列类型: str, str

  • 示例: python { anchor: A person on a horse jumps over a broken down airplane., positive: A person is training his horse for a competition., }

  • 收集策略: 从 SNLI & MultiNLI 数据集中读取,将 "premise" 作为 "anchor",将 "hypothesis" 作为 "positive"(如果标签为 "entailment")。不包括反向("entailment" 作为 "anchor","premise" 作为 "positive")。

  • 去重: 是

triplet 子集

  • : "anchor", "positive", "negative"

  • 列类型: str, str, str

  • 示例: python { anchor: A person on a horse jumps over a broken down airplane., positive: A person is outdoors, on a horse., negative: A person is at a diner, ordering an omelette., }

  • 收集策略: 从 SNLI & MultiNLI 数据集中读取,为每个 "premise" 生成一个包含 entailing 和 contradictory 句子的列表,然后从这些列表中生成所有可能的三元组。不包括反向("entailment" 作为 "anchor","premise" 作为 "positive")。

  • 去重: 是

引用信息

@inproceedings{budur-etal-2020-data, title = "Data and Representation for Turkish Natural Language Inference", author = "Budur, Emrah and "{O}zçelik, Rıza and G"{u}ng"{o}r, Tunga", booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)", month = nov, year = "2020", address = "Online", publisher = "Association for Computational Linguistics", abstract = "Large annotated datasets in NLP are overwhelmingly in English. This is an obstacle to progress in other languages. Unfortunately, obtaining new annotated resources for each task in each language would be prohibitively expensive. At the same time, commercial machine translation systems are now robust. Can we leverage these systems to translate English-language datasets automatically? In this paper, we offer a positive response for natural language inference (NLI) in Turkish. We translated two large English NLI datasets into Turkish and had a team of experts validate their translation quality and fidelity to the original labels. Using these datasets, we address core issues of representation for Turkish NLI. We find that in-language embeddings are essential and that morphological parsing can be avoided where the training set is large. Finally, we show that models trained on our machine-translated datasets are successful on human-translated evaluation sets. We share all code, models, and data publicly.", }

搜集汇总
数据集介绍
main_image_url
背景与挑战
背景概述
该数据集是一个土耳其语的自然语言推理(NLI)数据集,基于SNLI和MultiNLI数据集通过机器翻译和专家验证构建,包含多个子集(如pair、triplet)以支持特征提取和句子相似性任务。数据集规模在1M到10M之间,格式为csv,旨在训练或微调嵌入模型,用于语义文本相似性应用,并已有多模型在其上训练或微调。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作