ought/raft|自然语言处理数据集|机器学习数据集

hugging_face2022-10-25 更新2024-03-04 收录

自然语言处理

机器学习

下载链接：

https://hf-mirror.com/datasets/ought/raft

下载链接

链接失效反馈

资源简介：

RAFT（Real-world Annotated Few-shot Tasks）数据集是一个聚合了现实世界中英语数据集的集合，每个数据集都关联一个二分类或多分类任务，旨在提高我们对语言模型在具有实际价值的任务上表现的理解。每个数据集仅提供50个标注示例。数据集主要用于文本分类任务，支持的任务包括多类分类，并且可以通过RAFT排行榜提交结果。数据集完全使用美式英语（en-US），包含了多个子数据集，如Ade Corpus V2、Banking 77、NeurIPS Impact Statement Risks等。数据集的创建背景是为了评估NLP模型在真实世界任务上的表现，而不是使用人为构造的数据源。数据集的注释过程由专家和众包人员完成，部分数据集包含敏感信息。

提供机构：

ought

原始信息汇总

数据集概述

数据集名称

名称: Real-world Annotated Few-shot Tasks (RAFT)
别名: RAFT

数据集基本信息

语言: 英语（en-US）
许可证: 多种，包括MIT License、CC BY 4.0等
多语言性: 单语种
数据集来源: 原始数据集和扩展数据集（如ade_corpus_v2、banking77等）
任务类型: 文本分类（text-classification）
任务ID: 多类分类（multi-class-classification）

数据集结构

数据实例: 包含多个子数据集，如Ade Corpus V2、Banking 77等，每个子数据集包含文本数据和对应的标签。
数据字段: 包括ID和文本数据字段，其中ID用于索引数据点。
数据分割: 提供训练数据和未标记的测试数据，训练数据随机选择，不保证类别平衡。

数据集创建

采集理由: 为了创建一个不包含人为或人工数据源的NLP模型评估基准。
源数据: 多数数据集从现有来源收集，部分数据集如NeurIPS impact statement risks、Semiconductor org types、TAI Safety Research由RAFT团队直接收集。
标注: 标注过程包括直接在Google Spreadsheet中输入标注，标注者包括Ought支付的承包商和数据集策展人。

使用数据集的注意事项

个人和敏感信息: 部分数据集如Tweet Eval Hate包含高度冒犯性内容，NeurIPS impact statement risks包含作者姓名。
数据集限制: 如NeurIPS impact statement risks数据集可能包含未完全校验的文本。

附加信息

数据集策展人: 包括Neel Alex, Eli Lifland, 和 Andreas Stuhlmüller等。
许可证信息: 每个子数据集有自己的许可证，如Ade Corpus V2为无许可证，Banking 77为CC BY 4.0等。
贡献者: 感谢@neel-alex, @uvafan, 和 @lewtun等。

AI搜集汇总

数据集介绍

构建方式

RAFT数据集旨在通过聚合现实世界中的英文数据集，为NLP模型提供一个真实、有价值的测试基准。每个子数据集都关联一个二分类或多分类任务，仅提供50个标注示例，以模拟少样本学习场景。数据集的创建包括从原始数据源中收集和规范化数据，然后由专家或众包方式进行标注。

使用方法

使用RAFT数据集时，首先需要加载相应的子数据集，然后可以使用训练集进行模型训练，并使用未标注的测试集进行模型评估。数据集的ID字段用于索引数据点，其他字段则包含文本数据，包括句子、标题、摘要等。在处理数据时，需要注意保护个人和敏感信息，并遵守相关法律法规。

背景与挑战

背景概述

在自然语言处理（NLP）领域，小样本学习（few-shot learning）是一个关键的研究课题，旨在使模型能够从少量示例中学习并泛化到未见过的数据。为了评估和推动小样本学习的研究，Ought团队创建了Real-world Annotated Few-shot Tasks (RAFT)数据集。该数据集由多个英语语言的数据集组成，每个数据集都与一个二元或多元分类任务相关联，旨在提高我们对语言模型在具有实际价值的任务上表现的理解。RAFT数据集于2021年发布，由Ought团队负责，包括Neel Alex、Eli Lifland和Andreas Stuhlmüller等研究人员。该数据集通过提供仅50个标记示例的挑战，对NLP领域的小样本学习研究产生了重要影响。

当前挑战

RAFT数据集在解决实际世界中的文本分类任务方面面临多项挑战。首先，由于每个数据集只提供了50个标记示例，模型必须从非常有限的信息中学习，这要求模型具备强大的泛化能力。其次，数据集的构建过程中，研究人员需要从多个来源收集和整合数据，确保数据质量和一致性是一项复杂的任务。此外，由于数据集包含真实世界的文本，因此可能存在潜在的社会偏见和敏感信息，需要谨慎处理以避免不公平的结果。最后，由于数据集的规模和多样性，模型训练和评估的效率也是一个重要的挑战。

常用场景

经典使用场景

在自然语言处理领域，小样本学习（Few-shot Learning）一直是研究的热点，特别是对于文本分类任务。RAFT 数据集作为一个聚合了真实世界英语数据集的集合，其经典使用场景在于为小样本文本分类任务提供一个基准测试。该数据集包含了多个领域的数据集，如金融、科技、社交媒体等，每个数据集都提供了少量标签样本（通常为50个），并设计成二元或多元分类任务，这为模型评估提供了一个真实世界的应用背景。

解决学术问题

RAFT 数据集解决了在小样本学习场景下，NLP 模型评估的基准问题。传统的小样本学习研究往往依赖于人工构造的数据集，这些数据集可能无法反映真实世界中的数据分布和任务复杂性。RAFT 数据集通过聚合真实世界的数据，并设计成具有具体应用价值的小样本分类任务，使得研究人员能够更准确地评估模型在小样本学习场景下的性能，推动了小样本学习技术在真实世界应用中的发展。

实际应用

RAFT 数据集的实际应用场景非常广泛，它可以用于评估和改进小样本学习算法在真实世界文本分类任务中的表现。例如，在金融领域，RAFT 数据集可以用于评估模型对银行服务分类的能力；在科技领域，它可以用于评估模型对学术论文影响声明风险分类的能力；在社交媒体领域，它可以用于评估模型对仇恨言论检测的能力。此外，RAFT 数据集还可以用于开发新的小样本学习算法，以提高模型在真实世界应用中的泛化能力。

数据集最近研究

最新研究方向

在自然语言处理（NLP）领域，小样本学习已成为一个重要的研究方向。RAFT数据集的推出，旨在为NLP模型提供一个基于真实世界的、小样本任务的数据集，以便更准确地评估模型在实际应用中的表现。RAFT数据集涵盖了多个子任务，每个子任务都只提供50个标注样例，这使得研究者在模型训练过程中必须考虑如何有效地利用有限的标注数据。此外，RAFT数据集还提供了 leaderboard，以促进研究者之间的交流和竞争，推动小样本学习技术的发展。

以上内容由AI搜集并总结生成

用户留言

有没有相关的论文或文献参考？

这个数据集是基于什么背景创建的？

数据集的作者是谁？

能帮我联系到这个数据集的作者吗？

这个数据集如何下载？

点击留言

数据主题

具身智能

数据集 4098个

机构 8个

大模型

数据集 439个

机构 10个

无人机

数据集 37个

机构 6个

指令微调

数据集 36个

机构 6个

蛋白质结构

数据集 50个

机构 8个

空间智能

数据集 21个

机构 5个

5,000+

优质数据集

54 个

任务类型

进入经典数据集

热门数据集

China Health and Nutrition Survey (CHNS)

China Health and Nutrition Survey（CHNS）是一项由美国北卡罗来纳大学人口中心与中国疾病预防控制中心营养与健康所合作开展的长期开放性队列研究项目，旨在评估国家和地方政府的健康、营养与家庭计划政策对人群健康和营养状况的影响，以及社会经济转型对居民健康行为和健康结果的作用。该调查覆盖中国15个省份和直辖市的约7200户家庭、超过30000名个体，采用多阶段随机抽样方法，收集了家庭、个体以及社区层面的详细数据，包括饮食、健康、经济和社会因素等信息。自2011年起，CHNS不断扩展，新增多个城市和省份，并持续完善纵向数据链接，为研究中国社会经济变化与健康营养的动态关系提供了重要的数据支持。

www.cpc.unc.edu 收录

Canadian Census

**Overview** The data package provides demographics for Canadian population groups according to multiple location categories: Forward Sortation Areas (FSAs), Census Metropolitan Areas (CMAs) and Census Agglomerations (CAs), Federal Electoral Districts (FEDs), Health Regions (HRs) and provinces. **Description** The data are available through the Canadian Census and the National Household Survey (NHS), separated or combined. The main demographic indicators provided for the population groups, stratified not only by location but also for the majority by demographical and socioeconomic characteristics, are population number, females and males, usual residents and private dwellings. The primary use of the data at the Health Region level is for health surveillance and population health research. Federal and provincial departments of health and human resources, social service agencies, and other types of government agencies use the information to monitor, plan, implement and evaluate programs to improve the health of Canadians and the efficiency of health services. Researchers from various fields use the information to conduct research to improve health. Non-profit health organizations and the media use the health region data to raise awareness about health, an issue of concern to all Canadians. The Census population counts for a particular geographic area representing the number of Canadians whose usual place of residence is in that area, regardless of where they happened to be on Census Day. Also included are any Canadians who were staying in that area on Census Day and who had no usual place of residence elsewhere in Canada, as well as those considered to be 'non-permanent residents'. National Household Survey (NHS) provides demographic data for various levels of geography, including provinces and territories, census metropolitan areas/census agglomerations, census divisions, census subdivisions, census tracts, federal electoral districts and health regions. In order to provide a comprehensive overview of an area, this product presents data from both the NHS and the Census. NHS data topics include immigration and ethnocultural diversity; aboriginal peoples; education and labor; mobility and migration; language of work; income and housing. 2011 Census data topics include population and dwelling counts; age and sex; families, households and marital status; structural type of dwelling and collectives; and language. The data are collected for private dwellings occupied by usual residents. A private dwelling is a dwelling in which a person or a group of persons permanently reside. Information for the National Household Survey does not include information for collective dwellings. Collective dwellings are dwellings used for commercial, institutional or communal purposes, such as a hotel, a hospital or a work camp. **Benefits** - Useful for canada public health stakeholders, for public health specialist or specialized public and other interested parties. for health surveillance and population health research. for monitoring, planning, implementation and evaluation of health-related programs. media agencies may use the health regions data to raise awareness about health, an issue of concern to all canadians. giving the addition of longitude and latitude in some of the datasets the data can be useful to transpose the values into geographical representations. the fields descriptions along with the dataset description are useful for the user to quickly understand the data and the dataset. **License Information** The use of John Snow Labs datasets is free for personal and research purposes. For commercial use please subscribe to the [Data Library](https://www.johnsnowlabs.com/marketplace/) on John Snow Labs website. The subscription will allow you to use all John Snow Labs datasets and data packages for commercial purposes. **Included Datasets** - [Canadian Population and Dwelling by FSA 2011](https://www.johnsnowlabs.com/marketplace/canadian-population-and-dwelling-by-fsa-2011) - This Canadian Census dataset covers data on population, total private dwellings and private dwellings occupied by usual residents by forward sortation area (FSA). It is enriched with the percentage of the population or dwellings versus the total amount as well as the geographical area, province, and latitude and longitude. The whole Canada's population is marked as 100, referring to 100% for the percentages. - [Detailed Canadian Population Statistics by CMAs and CAs 2011](https://www.johnsnowlabs.com/marketplace/detailed-canadian-population-statistics-by-cmas-and-cas-2011) - This dataset covers the population statistics of Canada by Census Metropolitan Areas (CMAs) and Census Agglomerations (CAs). It is categorized also by citizen/immigration status, ethnic origin, religion, mobility, education, language, work, housing, income etc. There is detailed characteristics categorization within these stated categories that are in 5 layers. - [Detailed Canadian Population Statistics by FED 2011](https://www.johnsnowlabs.com/marketplace/detailed-canadian-population-statistics-by-fed-2011) - This dataset covers the population statistics of Canada from 2011 by Federal Electoral District of 2013 Representation Order. It is categorized also by citizen/immigration status, ethnic origin, religion, mobility, education, language, work, housing, income etc. There is detailed characteristics categorization within these stated categories that are in 5 layers. - [Detailed Canadian Population Statistics by Health Region 2011](https://www.johnsnowlabs.com/marketplace/detailed-canadian-population-statistics-by-health-region-2011) - This dataset covers the population statistics of Canada by health region. It is categorized also by citizen/immigration status, ethnic origin, religion, mobility, education, language, work, housing, income etc. There is detailed characteristics categorization within these stated categories that are in 5 layers. - [Detailed Canadian Population Statistics by Province 2011](https://www.johnsnowlabs.com/marketplace/detailed-canadian-population-statistics-by-province-2011) - This dataset covers the population statistics of Canada by provinces and territories. It is categorized also by citizen/immigration status, ethnic origin, religion, mobility, education, language, work, housing, income etc. There is detailed characteristics categorization within these stated categories that are in 5 layers. **Data Engineering Overview** **We deliver high-quality data** - Each dataset goes through 3 levels of quality review - 2 Manual reviews are done by domain experts - Then, an automated set of 60+ validations enforces every datum matches metadata & defined constraints - Data is normalized into one unified type system - All dates, unites, codes, currencies look the same - All null values are normalized to the same value - All dataset and field names are SQL and Hive compliant - Data and Metadata - Data is available in both CSV and Apache Parquet format, optimized for high read performance on distributed Hadoop, Spark & MPP clusters - Metadata is provided in the open Frictionless Data standard, and its every field is normalized & validated - Data Updates - Data updates support replace-on-update: outdated foreign keys are deprecated, not deleted **Our data is curated and enriched by domain experts** Each dataset is manually curated by our team of doctors, pharmacists, public health & medical billing experts: - Field names, descriptions, and normalized values are chosen by people who actually understand their meaning - Healthcare & life science experts add categories, search keywords, descriptions and more to each dataset - Both manual and automated data enrichment supported for clinical codes, providers, drugs, and geo-locations - The data is always kept up to date – even when the source requires manual effort to get updates - Support for data subscribers is provided directly by the domain experts who curated the data sets - Every data source’s license is manually verified to allow for royalty-free commercial use and redistribution. **Need Help?** If you have questions about our products, contact us at [info@johnsnowlabs.com](mailto:info@johnsnowlabs.com).

Databricks 收录

中国气象数据

本数据集包含了中国2023年1月至11月的气象数据，包括日照时间、降雨量、温度、风速等关键数据。通过这些数据，可以深入了解气象现象对不同地区的影响，并通过可视化工具揭示中国的气温分布、降水情况、风速趋势等。

github 收录

YOLO Drone Detection Dataset

为了促进无人机检测模型的开发和评估，我们引入了一个新颖且全面的数据集，专门为训练和测试无人机检测算法而设计。该数据集来源于Kaggle上的公开数据集，包含在各种环境和摄像机视角下捕获的多样化的带注释图像。数据集包括无人机实例以及其他常见对象，以实现强大的检测和分类。

github 收录

DAT

DAT是一个统一的跨场景跨领域基准，用于开放世界无人机主动跟踪。它提供了24个视觉复杂的场景，以评估算法的跨场景和跨领域泛化能力，并具有高保真度的现实机器人动力学建模。

github 收录