five

nyu-mll/multi_nli|自然语言处理数据集|文本蕴含数据集

收藏
hugging_face2024-01-04 更新2024-03-04 收录
自然语言处理
文本蕴含
下载链接:
https://hf-mirror.com/datasets/nyu-mll/multi_nli
下载链接
链接失效反馈
资源简介:
Multi-Genre Natural Language Inference (MultiNLI) 数据集是一个包含433,000个句子对的众包数据集,这些句子对标注了文本蕴含信息。该数据集基于SNLI语料库,但涵盖了多种口语和书面文本的体裁,并支持跨体裁的泛化评估。数据集主要用于文本分类任务,特别是自然语言推理和多输入文本分类。数据集的语言为英语,大小为100K<n<1M,且为单语言数据集。数据集的创建目的是为了评估模型在训练领域内的句子表示质量以及在不熟悉领域内推导合理表示的能力。
提供机构:
nyu-mll
原始信息汇总

数据集概述

名称: Multi-Genre Natural Language Inference (MultiNLI)

语言: 英语

许可证:

  • cc-by-3.0
  • cc-by-sa-3.0
  • mit
  • other

多语言性: 单语

大小: 100K<n<1M

源数据: 原始

任务类别: 文本分类

任务ID:

  • natural-language-inference
  • multi-input-text-classification

论文代码ID: multinli

美观名称: Multi-Genre Natural Language Inference

数据集结构

数据实例

数据集包含以下字段:

  • promptID: 整数类型,唯一标识符
  • pairID: 字符串类型,唯一标识符
  • premise: 字符串类型
  • premise_binary_parse: 字符串类型
  • premise_parse: 字符串类型
  • hypothesis: 字符串类型
  • hypothesis_binary_parse: 字符串类型
  • hypothesis_parse: 字符串类型
  • genre: 字符串类型
  • label: 分类标签,包括entailment (0), neutral (1), contradiction (2)

数据分割

  • 训练集: 392702个实例
  • 验证匹配集: 9815个实例
  • 验证不匹配集: 9832个实例

数据集创建

源数据

  • 数据收集: 通过从现有文本源选择前提句,并要求人工注释者编写与之配对的新句子作为假设。

许可证详情

  • 开放部分: 美国国家语料库的许可证
  • 小说部分: 多种许可,包括Creative Commons Share-Alike 3.0 Unported License和Creative Commons Attribution 3.0 Unported Licenses

引用信息

@InProceedings{N18-1101, author = "Williams, Adina and Nangia, Nikita and Bowman, Samuel", title = "A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference", booktitle = "Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)", year = "2018", publisher = "Association for Computational Linguistics", pages = "1112--1122", location = "New Orleans, Louisiana", url = "http://aclweb.org/anthology/N18-1101" }

AI搜集汇总
数据集介绍
main_image_url
构建方式
Multi-Genre Natural Language Inference (MultiNLI) 数据集的构建基于众包方式,通过从现有文本源中选择前提句,并由人工注释者创作与之配对的假设句,形成句子对。这些句子对随后被标注为蕴含、中立或矛盾三类关系,从而形成一个包含433k句子对的语料库。该数据集的设计旨在评估模型在训练域内外的句子表示质量,特别强调跨域泛化的能力。
特点
MultiNLI 数据集的主要特点在于其跨域泛化的设计,涵盖了多种口语和书面文本类型,支持在不同领域间进行模型评估。此外,数据集提供了详细的句子解析信息,包括二叉树解析和PCFG解析,这为研究者提供了丰富的语义分析工具。数据集的标注质量高,且包含明确的训练、验证和测试分割,便于模型训练和评估。
使用方法
使用 MultiNLI 数据集时,研究者可以利用其提供的训练集进行模型训练,并通过验证集和测试集进行性能评估。数据集的结构清晰,包含前提句、假设句及其解析信息,以及标注的蕴含关系。研究者可以根据需要选择不同的数据分割,如匹配验证集和非匹配验证集,以评估模型在不同领域的表现。此外,数据集的标注信息可以直接用于监督学习任务,如自然语言推理和多输入文本分类。
背景与挑战
背景概述
Multi-Genre Natural Language Inference (MultiNLI) 数据集是由纽约大学 (NYU) 的 Samuel Bowman 教授及其团队创建的,旨在推动自然语言推理 (NLI) 领域的研究。该数据集于2018年发布,包含433,000对句子,通过众包方式进行标注,涵盖了多种文本类型,包括口语和书面语。MultiNLI 数据集的构建基于 SNLI 数据集,但特别强调了跨领域泛化的评估,为自然语言处理领域提供了一个广泛覆盖的挑战性语料库,支持了2017年 EMNLP 会议上的 RepEval 共享任务。
当前挑战
MultiNLI 数据集面临的挑战主要集中在两个方面:一是如何确保在不同文本类型和领域中的泛化能力,这要求模型不仅在训练域内表现良好,还需在未见过的领域中保持合理的表现;二是数据集构建过程中,如何有效地进行众包标注,确保标注的一致性和准确性。此外,数据集的多样性也带来了处理复杂性和计算资源的需求,尤其是在处理大规模文本数据时,如何高效地进行模型训练和评估也是一个重要的挑战。
常用场景
经典使用场景
在自然语言处理领域,Multi-Genre Natural Language Inference (MultiNLI) 数据集被广泛用于文本蕴含任务。该数据集通过收集433k对句子,标注了文本间的蕴含关系,包括蕴含、中立和矛盾三种类别。研究者利用此数据集训练和评估模型在不同文本类型中的表现,特别是在跨领域文本蕴含识别中的泛化能力。
解决学术问题
MultiNLI数据集解决了自然语言推理任务中的跨领域泛化问题。通过提供多种文本类型的句子对,该数据集帮助研究者开发能够在不同领域间迁移知识的模型,从而提升模型的鲁棒性和适应性。这对于推动自然语言处理技术在实际应用中的广泛适用性具有重要意义。
衍生相关工作
基于MultiNLI数据集,研究者们开发了多种自然语言推理模型,如BERT、RoBERTa等,这些模型在多个基准测试中表现优异。此外,MultiNLI还启发了其他类似数据集的创建,如XNLI,它扩展了MultiNLI的多语言支持,进一步推动了跨语言自然语言推理的研究。
以上内容由AI搜集并总结生成
用户留言
有没有相关的论文或文献参考?
这个数据集是基于什么背景创建的?
数据集的作者是谁?
能帮我联系到这个数据集的作者吗?
这个数据集如何下载?
点击留言
数据主题
具身智能
数据集  4098个
机构  8个
大模型
数据集  439个
机构  10个
无人机
数据集  37个
机构  6个
指令微调
数据集  36个
机构  6个
蛋白质结构
数据集  50个
机构  8个
空间智能
数据集  21个
机构  5个
5,000+
优质数据集
54 个
任务类型
进入经典数据集
热门数据集

Canadian Census

**Overview** The data package provides demographics for Canadian population groups according to multiple location categories: Forward Sortation Areas (FSAs), Census Metropolitan Areas (CMAs) and Census Agglomerations (CAs), Federal Electoral Districts (FEDs), Health Regions (HRs) and provinces. **Description** The data are available through the Canadian Census and the National Household Survey (NHS), separated or combined. The main demographic indicators provided for the population groups, stratified not only by location but also for the majority by demographical and socioeconomic characteristics, are population number, females and males, usual residents and private dwellings. The primary use of the data at the Health Region level is for health surveillance and population health research. Federal and provincial departments of health and human resources, social service agencies, and other types of government agencies use the information to monitor, plan, implement and evaluate programs to improve the health of Canadians and the efficiency of health services. Researchers from various fields use the information to conduct research to improve health. Non-profit health organizations and the media use the health region data to raise awareness about health, an issue of concern to all Canadians. The Census population counts for a particular geographic area representing the number of Canadians whose usual place of residence is in that area, regardless of where they happened to be on Census Day. Also included are any Canadians who were staying in that area on Census Day and who had no usual place of residence elsewhere in Canada, as well as those considered to be 'non-permanent residents'. National Household Survey (NHS) provides demographic data for various levels of geography, including provinces and territories, census metropolitan areas/census agglomerations, census divisions, census subdivisions, census tracts, federal electoral districts and health regions. In order to provide a comprehensive overview of an area, this product presents data from both the NHS and the Census. NHS data topics include immigration and ethnocultural diversity; aboriginal peoples; education and labor; mobility and migration; language of work; income and housing. 2011 Census data topics include population and dwelling counts; age and sex; families, households and marital status; structural type of dwelling and collectives; and language. The data are collected for private dwellings occupied by usual residents. A private dwelling is a dwelling in which a person or a group of persons permanently reside. Information for the National Household Survey does not include information for collective dwellings. Collective dwellings are dwellings used for commercial, institutional or communal purposes, such as a hotel, a hospital or a work camp. **Benefits** - Useful for canada public health stakeholders, for public health specialist or specialized public and other interested parties. for health surveillance and population health research. for monitoring, planning, implementation and evaluation of health-related programs. media agencies may use the health regions data to raise awareness about health, an issue of concern to all canadians. giving the addition of longitude and latitude in some of the datasets the data can be useful to transpose the values into geographical representations. the fields descriptions along with the dataset description are useful for the user to quickly understand the data and the dataset. **License Information** The use of John Snow Labs datasets is free for personal and research purposes. For commercial use please subscribe to the [Data Library](https://www.johnsnowlabs.com/marketplace/) on John Snow Labs website. The subscription will allow you to use all John Snow Labs datasets and data packages for commercial purposes. **Included Datasets** - [Canadian Population and Dwelling by FSA 2011](https://www.johnsnowlabs.com/marketplace/canadian-population-and-dwelling-by-fsa-2011) - This Canadian Census dataset covers data on population, total private dwellings and private dwellings occupied by usual residents by forward sortation area (FSA). It is enriched with the percentage of the population or dwellings versus the total amount as well as the geographical area, province, and latitude and longitude. The whole Canada's population is marked as 100, referring to 100% for the percentages. - [Detailed Canadian Population Statistics by CMAs and CAs 2011](https://www.johnsnowlabs.com/marketplace/detailed-canadian-population-statistics-by-cmas-and-cas-2011) - This dataset covers the population statistics of Canada by Census Metropolitan Areas (CMAs) and Census Agglomerations (CAs). It is categorized also by citizen/immigration status, ethnic origin, religion, mobility, education, language, work, housing, income etc. There is detailed characteristics categorization within these stated categories that are in 5 layers. - [Detailed Canadian Population Statistics by FED 2011](https://www.johnsnowlabs.com/marketplace/detailed-canadian-population-statistics-by-fed-2011) - This dataset covers the population statistics of Canada from 2011 by Federal Electoral District of 2013 Representation Order. It is categorized also by citizen/immigration status, ethnic origin, religion, mobility, education, language, work, housing, income etc. There is detailed characteristics categorization within these stated categories that are in 5 layers. - [Detailed Canadian Population Statistics by Health Region 2011](https://www.johnsnowlabs.com/marketplace/detailed-canadian-population-statistics-by-health-region-2011) - This dataset covers the population statistics of Canada by health region. It is categorized also by citizen/immigration status, ethnic origin, religion, mobility, education, language, work, housing, income etc. There is detailed characteristics categorization within these stated categories that are in 5 layers. - [Detailed Canadian Population Statistics by Province 2011](https://www.johnsnowlabs.com/marketplace/detailed-canadian-population-statistics-by-province-2011) - This dataset covers the population statistics of Canada by provinces and territories. It is categorized also by citizen/immigration status, ethnic origin, religion, mobility, education, language, work, housing, income etc. There is detailed characteristics categorization within these stated categories that are in 5 layers. **Data Engineering Overview** **We deliver high-quality data** - Each dataset goes through 3 levels of quality review - 2 Manual reviews are done by domain experts - Then, an automated set of 60+ validations enforces every datum matches metadata & defined constraints - Data is normalized into one unified type system - All dates, unites, codes, currencies look the same - All null values are normalized to the same value - All dataset and field names are SQL and Hive compliant - Data and Metadata - Data is available in both CSV and Apache Parquet format, optimized for high read performance on distributed Hadoop, Spark & MPP clusters - Metadata is provided in the open Frictionless Data standard, and its every field is normalized & validated - Data Updates - Data updates support replace-on-update: outdated foreign keys are deprecated, not deleted **Our data is curated and enriched by domain experts** Each dataset is manually curated by our team of doctors, pharmacists, public health & medical billing experts: - Field names, descriptions, and normalized values are chosen by people who actually understand their meaning - Healthcare & life science experts add categories, search keywords, descriptions and more to each dataset - Both manual and automated data enrichment supported for clinical codes, providers, drugs, and geo-locations - The data is always kept up to date – even when the source requires manual effort to get updates - Support for data subscribers is provided directly by the domain experts who curated the data sets - Every data source’s license is manually verified to allow for royalty-free commercial use and redistribution. **Need Help?** If you have questions about our products, contact us at [info@johnsnowlabs.com](mailto:info@johnsnowlabs.com).

Databricks 收录

马达加斯加岛 – 世界地理数据大百科辞条

马达加斯加岛在非洲的东南部,位于11o56′59″S - 25o36′25″S及43o11′18″E - 50o29′36″E之间。通过莫桑比克海峡与位于非洲大陆的莫桑比克相望,最近距离为415千米。临近的岛屿分别为西北部的科摩罗群岛、北部的塞舌尔群岛、东部的毛里求斯岛和留尼汪岛等。在google earth 2015年遥感影像基础上研发的马达加斯加海岸线数据集表明,马达加斯加岛面积591,128.68平方千米,其中马达加斯加本岛面积589,015.06平方千米,周边小岛面积为2,113.62平方千米。马达加斯加本岛是非洲第一大岛,是仅次于格陵兰、新几内亚岛和加里曼丹岛的世界第四大岛屿。岛的形状呈南北走向狭长纺锤形,南北向长1,572千米;南北窄,中部宽,最宽处达574千米。海岸线总长16,309.27千米, 其中马达加斯加本岛海岸线长10,899.03千米,周边小岛海岸线长5,410.24千米。马达加斯加岛属于马达加斯加共和国。全国共划分22个区,119个县。22个区分别为:阿那拉芒加区,第亚那区,上马齐亚特拉区,博爱尼区,阿齐那那那区,阿齐莫-安德列发那区,萨瓦区,伊达西区,法基南卡拉塔区,邦古拉法区,索非亚区,贝齐博卡区,梅拉基区,阿拉奥特拉-曼古罗区,阿那拉兰基罗富区,阿莫罗尼马尼亚区,法土法韦-非图韦那尼区,阿齐莫-阿齐那那那区,伊霍罗贝区,美那贝区,安德罗伊区和阿诺西区。首都安塔那那利佛(Antananarivo)位于岛屿的中东部。马达加斯加岛是由火山及喀斯特地貌为主。贯穿海岛的是巨大火山岩山体-察腊塔纳山,其主峰马鲁穆库特鲁山(Maromokotro)海拔2,876米,是全国最高峰。马达加斯加自然景观垂直地带性分异显著,是热带雨林和热带草原广布的地区。岛上大约有20多万种动植物,其中包括马达加斯加特有物种狐猴(Lemur catta)、马达加斯加国树猴面包树(Adansonia digitata L.)等。

国家对地观测科学数据中心 收录

TCIA

TCIA(The Cancer Imaging Archive)是一个公开的癌症影像数据集,包含多种癌症类型的医学影像数据,如CT、MRI、PET等。这些数据通常与临床和病理信息相结合,用于癌症研究和临床试验。

www.cancerimagingarchive.net 收录

AIS数据集

该研究使用了多个公开的AIS数据集,这些数据集经过过滤、清理和统计分析。数据集涵盖了多种类型的船舶,并提供了关于船舶位置、速度和航向的关键信息。数据集包括来自19,185艘船舶的AIS消息,总计约6.4亿条记录。

github 收录

PDT Dataset

PDT数据集是由山东计算机科学中心(国家超级计算济南中心)和齐鲁工业大学(山东省科学院)联合开发的无人机目标检测数据集,专门用于检测树木病虫害。该数据集包含高分辨率和低分辨率两种版本,共计5775张图像,涵盖了健康和受病虫害影响的松树图像。数据集的创建过程包括实地采集、数据预处理和人工标注,旨在为无人机在农业中的精准喷洒提供高精度的目标检测支持。PDT数据集的应用领域主要集中在农业无人机技术,旨在提高无人机在植物保护中的目标识别精度,解决传统检测模型在实际应用中的不足。

arXiv 收录