agentlans/high-quality-nli

Name: agentlans/high-quality-nli
Creator: agentlans
Published: 2024-11-13 03:44:46
License: 暂无描述

Hugging Face2024-11-13 更新2024-12-14 收录

下载链接：

https://hf-mirror.com/datasets/agentlans/high-quality-nli

下载链接

链接失效反馈

官方服务：

资源简介：

该数据集专为自然语言推理（NLI）任务设计，包含高质量的句子对。它通过提供更复杂和细致的例子，改进了常用的NLI数据集，适合高级语言理解模型。数据集的大小为688,713条记录，分为训练集和测试集。数据集的创建过程包括从高质量英语句子和FineWebEdu-NLI数据集中随机采样句子，使用Llama 3或Flan-T5-base模型生成假设，并通过cross-encoder/nli-deberta-v3-xsmall模型检查标签的正确性。数据字段包括前提、假设和标签。数据集存在一些局限性，如标签可能不正确、句子复杂性导致的歧义、类别不平衡等。

This dataset is designed for Natural Language Inference (NLI) tasks, containing high-quality sentence pairs. It improves upon commonly used NLI datasets by offering more complex and nuanced examples, making it suitable for advanced language understanding models. The dataset includes a training set and a test set, totaling 688,713 samples. The creation process involved randomly sampling sentences from specific source datasets, generating hypotheses using either the Llama 3 model or the Flan-T5-base model, and verifying the labels using the cross-encoder/nli-deberta-v3-xsmall model to ensure correctness. The dataset contains three categories of labels: entailment (0), neutral (1), and contradiction (2). Despite rigorous quality control, there are still some limitations and biases, such as potential inaccuracies in labels, ambiguity due to sentence complexity, class imbalance, and repetitive sentences.

提供机构：

agentlans

5,000+

优质数据集

54 个

任务类型

进入经典数据集