five

zhili312/Textual-Natural-Contextual-Classification

收藏
Hugging Face2023-10-30 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/zhili312/Textual-Natural-Contextual-Classification
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 task_categories: - text-classification language: - en pretty_name: TNCC size_categories: - 1K<n<10K --- Given the scarcity of datasets for understanding natural language in visual scenes, we introduce a novel textual entailment dataset, named Textual Natural Contextual Classification (TNCC). This dataset is formulated on the foundation of Crisscrossed Captions (https://github.com/google-research-datasets/Crisscrossed-Captions), an image captioning dataset supplied with human-rated semantic similarity ratings on a continuous scale from 0 to 5. We tailor the dataset to suit a binary classification task. Specifically, sentence pairs with annotation scores exceeding 4 are categorized as positive (entailment), whereas pairs with scores less than 1 are marked as negative (non-entailment). The TNCC dataset is partitioned into training, validation, and testing sets, containing 3,600, 1,200, and 1,560 instances, respectively. If you use this dataset for academic research, please cite the NeurIPS 2023 paper titled 'Back-Modality: Leveraging Modal Transformation for Data Augmentation'.
提供机构:
zhili312
原始信息汇总

数据集概述

数据集名称

  • 名称: TNCC

数据集类型

  • 类型: 文本蕴含数据集

任务类别

  • 任务: 文本分类

语言

  • 语言: 英语

数据集大小

  • 大小: 1K<n<10K

数据集描述

  • 描述: TNCC数据集基于Crisscrossed Captions图像字幕数据集构建,该数据集提供了人类评定的语义相似度评分,范围从0到5。TNCC数据集针对二分类任务进行了调整,具体来说,评分超过4的句子对被归类为正类(蕴含),而评分低于1的句子对被标记为负类(非蕴含)。

数据集划分

  • 划分:
    • 训练集: 3,600个实例
    • 验证集: 1,200个实例
    • 测试集: 1,560个实例

引用信息

  • 引用: 如果使用该数据集进行学术研究,请引用NeurIPS 2023论文《Back-Modality: Leveraging Modal Transformation for Data Augmentation》。
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作