emrecan/nli_tr_for_simcse

Name: emrecan/nli_tr_for_simcse
Creator: emrecan
Published: 2023-01-25 16:56:04
License: 暂无描述

Hugging Face2023-01-25 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/emrecan/nli_tr_for_simcse

下载链接

链接失效反馈

官方服务：

资源简介：

该数据集是NLI-TR数据集的修改版本，用于训练Supervised SimCSE模型以生成句子嵌入。数据集生成过程包括合并snli_tr和multinli_tr子集的训练集，找到每个具有蕴含假设和矛盾假设的前提，并将找到的三元组写入sent0（前提）、sent1（蕴含假设）、hard_neg（矛盾假设）格式。该数据集适用于土耳其语句子的语义相似性评分和文本评分任务。

This dataset is a modified version of the NLI-TR dataset, intended for training Supervised SimCSE models to generate sentence embeddings. The dataset generation process includes merging the training splits of the snli_tr and multinli_tr subsets, identifying premises paired with both their entailment and contradiction hypotheses, and writing the obtained triples into the format of sent0 (premise), sent1 (entailment hypothesis), and hard_neg (contradiction hypothesis). This dataset is applicable to semantic similarity scoring and text scoring tasks for Turkish sentences.

提供机构：

emrecan

原始信息汇总

数据集概述

基本信息

语言: 土耳其语 (tr)
大小: 10万至100万条记录
来源数据集: NLI-TR
任务类别: 文本分类
任务ID:
- 语义相似度评分
- 文本评分

数据集用途

用于训练监督式SimCSE模型，以生成句子嵌入。

数据处理步骤

合并snli_tr和multinli_tr子集的训练分割。
找出具有蕴含假设和矛盾假设的前提。
将找到的三元组写入格式：sent0（前提），sent1（蕴含假设），hard_neg（矛盾假设）。

5,000+

优质数据集

54 个

任务类型

进入经典数据集