CREAK|常识推理数据集|实体知识数据集

github2021-09-01 更新2025-02-08 收录

常识推理

实体知识

下载链接：

https://github.com/yasumasaonoe/creak

下载链接

链接失效反馈

资源简介：

为了探索模型将实体知识与常识推理相结合的能力，引入了CREAK数据集。它建立了关于实体的事实细节（例如，像哈利·波特这样的巫师擅长骑扫帚飞行）与常识推理原则（例如，擅长某项技能就可以教别人）之间的联系。这一过程产生了推理查询（例如，哈利·波特是否能够教授骑扫帚飞行）。

To explore the model's ability to integrate entity knowledge with common-sense reasoning, the CREAK dataset was introduced. It establishes connections between factual details about entities (e.g., wizards like Harry Potter are skilled at broomstick flying) and common-sense reasoning principles (e.g., proficiency in a skill allows one to teach others). This process generates reasoning queries (e.g., whether Harry Potter can teach broomstick flying).

提供机构：

The University of Texas at Austin

创建时间：

2021-09-01

原始信息汇总

CREAK 数据集概述

基本信息

数据集名称：CREAK: A Dataset for Commonsense Reasoning over Entity Knowledge
论文信息：
- 标题：CREAK: A Dataset for Commonsense Reasoning over Entity Knowledge
- 作者：Yasumasa Onoe, Michael J.Q. Zhang, Eunsol Choi, Greg Durrett
- 会议：NeurIPS 2021 Datasets and Benchmarks Track
- 论文链接：https://openreview.net/pdf?id=mbW_GT3ZN-
- 年份：2021

数据集内容

数据文件：
- train.json：10,176 个训练示例
- dev.json：1,371 个开发示例
- test_without_labels.json：1,371 个测试示例（无标签）
- contrast_set.json：500 个对比示例
数据格式：jsonlines

示例字段说明：

字段	描述
`ex_id`	示例 ID
`sentence`	声明
`explanation`	注释者提供的解释，说明声明为 TRUE/FALSE 的原因
`label`	标签：true 或 false
`entity`	种子实体
`en_wiki_pageid`	种子实体的英文维基百科页面 ID
`entity_mention_loc`	种子实体在声明中的位置

更新信息

更新时间：2021年11月8日
更新内容：对比集增加到500个示例

基准与排行榜

基准：详见 baselines/README.md
排行榜：https://www.cs.utexas.edu/~yasumasa/creak/leaderboard.html
- 提交要求：仅接受基于闭卷方法且在域内数据上微调的结果
- 提交方式：发送系统名称及开发集、测试集和对比集的预测文件至 yasumasa@utexas.edu

联系方式

联系人：yasumasa@utexas.edu

AI搜集汇总

数据集介绍

构建方式

CREAK数据集旨在支持基于实体知识的常识推理研究，其构建过程依托于人工标注与自动化工具的结合。数据集的构建首先从维基百科中选取实体作为种子，随后通过人工标注生成与这些实体相关的陈述句，并对其真实性进行判断。每个陈述句均附有详细的解释，说明其真实或虚假的原因。此外，数据集还包含一个对比集，用于评估模型在对抗性样本上的表现。整个构建过程严格遵循质量控制流程，以确保数据的准确性和多样性。

特点

CREAK数据集以其丰富的实体知识和常识推理任务为核心特点。数据集包含超过10,000条训练样本和1,371条开发样本，每条样本均包含一个陈述句、其真实性标签以及详细的解释。数据集特别注重实体知识的多样性，涵盖了广泛的实体类别和常识场景。此外，CREAK还提供了一个包含500条样本的对比集，用于测试模型在对抗性样本上的鲁棒性。这种设计使得CREAK成为评估和提升常识推理模型性能的理想选择。

使用方法

CREAK数据集的使用方法灵活多样，适用于多种自然语言处理任务。用户可通过加载JSON格式的数据文件，获取训练、开发和测试样本。每条样本包含陈述句、标签、解释及实体信息，便于模型训练和评估。数据集特别适用于闭卷式常识推理任务，用户可通过微调模型在训练集上进行学习，并在开发集和测试集上验证性能。此外，对比集可用于进一步测试模型的鲁棒性。用户还可通过提交预测结果参与官方排行榜，以评估模型在领域内的表现。

背景与挑战

背景概述

CREAK数据集由Yasumasa Onoe、Michael J.Q. Zhang、Eunsol Choi和Greg Durrett等研究人员于2021年提出，并在NeurIPS 2021的数据集与基准测试轨道上发表。该数据集专注于常识推理领域，特别是针对实体知识的推理任务。CREAK旨在通过提供包含真实与虚假声明的句子，以及相应的解释和实体信息，推动自然语言处理领域中对常识推理能力的深入研究。其数据来源于维基百科，涵盖了广泛的实体和情境，为研究者提供了一个丰富的资源，以探索模型在处理复杂常识推理任务时的表现。

当前挑战

CREAK数据集面临的挑战主要体现在两个方面。首先，常识推理本身具有高度的复杂性和模糊性，模型需要理解隐含的背景知识和上下文关系，才能准确判断声明的真实性。其次，在数据集的构建过程中，如何确保标注的一致性和准确性是一个关键问题。由于常识推理涉及主观判断，不同标注者可能对同一声明有不同的理解，因此需要设计严格的标注流程和质量控制机制。此外，数据集中包含的实体多样性也增加了模型泛化的难度，要求模型具备较强的跨领域推理能力。

常用场景

经典使用场景

CREAK数据集专为常识推理任务设计，广泛应用于自然语言处理领域，特别是在实体知识的推理和验证方面。研究者利用该数据集训练和评估模型，以提升模型在理解复杂实体关系及常识推理方面的能力。通过提供丰富的实体信息和详细的解释，CREAK为模型提供了深入学习和推理的基础。

解决学术问题

CREAK数据集解决了自然语言处理中常识推理的挑战，特别是在实体知识的验证和推理方面。通过提供大量标注数据，该数据集帮助研究者开发出能够准确判断陈述真实性的模型，从而推动了常识推理领域的研究进展。其独特的对比集设计进一步增强了模型的鲁棒性和泛化能力。

衍生相关工作

基于CREAK数据集，研究者们开发了多种先进的自然语言处理模型，如基于Transformer的预训练模型和对比学习框架。这些模型在常识推理任务中表现出色，进一步推动了该领域的技术发展。CREAK数据集还激发了更多关于实体知识和常识推理的研究，为后续工作提供了宝贵的数据和基准。

以上内容由AI搜集并总结生成

用户留言

有没有相关的论文或文献参考？

这个数据集是基于什么背景创建的？

数据集的作者是谁？

能帮我联系到这个数据集的作者吗？

这个数据集如何下载？

点击留言

数据主题

具身智能

数据集 4098个

机构 8个

大模型

数据集 439个

机构 10个

无人机

数据集 37个

机构 6个

指令微调

数据集 36个

机构 6个

蛋白质结构

数据集 50个

机构 8个

空间智能

数据集 21个

机构 5个

5,000+

优质数据集

54 个

任务类型

进入经典数据集

热门数据集

Google Scholar

Google Scholar是一个学术搜索引擎，旨在检索学术文献、论文、书籍、摘要和文章等。它涵盖了广泛的学科领域，包括自然科学、社会科学、艺术和人文学科。用户可以通过关键词搜索、作者姓名、出版物名称等方式查找相关学术资源。

scholar.google.com 收录

中国裁判文书网

中国裁判文书网是中国最高人民法院设立的官方网站，旨在公开各级法院的裁判文书。该数据集包含了大量的法律文书，如判决书、裁定书、调解书等，涵盖了民事、刑事、行政、知识产权等多个法律领域。

wenshu.court.gov.cn 收录

Canadian Census

**Overview** The data package provides demographics for Canadian population groups according to multiple location categories: Forward Sortation Areas (FSAs), Census Metropolitan Areas (CMAs) and Census Agglomerations (CAs), Federal Electoral Districts (FEDs), Health Regions (HRs) and provinces. **Description** The data are available through the Canadian Census and the National Household Survey (NHS), separated or combined. The main demographic indicators provided for the population groups, stratified not only by location but also for the majority by demographical and socioeconomic characteristics, are population number, females and males, usual residents and private dwellings. The primary use of the data at the Health Region level is for health surveillance and population health research. Federal and provincial departments of health and human resources, social service agencies, and other types of government agencies use the information to monitor, plan, implement and evaluate programs to improve the health of Canadians and the efficiency of health services. Researchers from various fields use the information to conduct research to improve health. Non-profit health organizations and the media use the health region data to raise awareness about health, an issue of concern to all Canadians. The Census population counts for a particular geographic area representing the number of Canadians whose usual place of residence is in that area, regardless of where they happened to be on Census Day. Also included are any Canadians who were staying in that area on Census Day and who had no usual place of residence elsewhere in Canada, as well as those considered to be 'non-permanent residents'. National Household Survey (NHS) provides demographic data for various levels of geography, including provinces and territories, census metropolitan areas/census agglomerations, census divisions, census subdivisions, census tracts, federal electoral districts and health regions. In order to provide a comprehensive overview of an area, this product presents data from both the NHS and the Census. NHS data topics include immigration and ethnocultural diversity; aboriginal peoples; education and labor; mobility and migration; language of work; income and housing. 2011 Census data topics include population and dwelling counts; age and sex; families, households and marital status; structural type of dwelling and collectives; and language. The data are collected for private dwellings occupied by usual residents. A private dwelling is a dwelling in which a person or a group of persons permanently reside. Information for the National Household Survey does not include information for collective dwellings. Collective dwellings are dwellings used for commercial, institutional or communal purposes, such as a hotel, a hospital or a work camp. **Benefits** - Useful for canada public health stakeholders, for public health specialist or specialized public and other interested parties. for health surveillance and population health research. for monitoring, planning, implementation and evaluation of health-related programs. media agencies may use the health regions data to raise awareness about health, an issue of concern to all canadians. giving the addition of longitude and latitude in some of the datasets the data can be useful to transpose the values into geographical representations. the fields descriptions along with the dataset description are useful for the user to quickly understand the data and the dataset. **License Information** The use of John Snow Labs datasets is free for personal and research purposes. For commercial use please subscribe to the [Data Library](https://www.johnsnowlabs.com/marketplace/) on John Snow Labs website. The subscription will allow you to use all John Snow Labs datasets and data packages for commercial purposes. **Included Datasets** - [Canadian Population and Dwelling by FSA 2011](https://www.johnsnowlabs.com/marketplace/canadian-population-and-dwelling-by-fsa-2011) - This Canadian Census dataset covers data on population, total private dwellings and private dwellings occupied by usual residents by forward sortation area (FSA). It is enriched with the percentage of the population or dwellings versus the total amount as well as the geographical area, province, and latitude and longitude. The whole Canada's population is marked as 100, referring to 100% for the percentages. - [Detailed Canadian Population Statistics by CMAs and CAs 2011](https://www.johnsnowlabs.com/marketplace/detailed-canadian-population-statistics-by-cmas-and-cas-2011) - This dataset covers the population statistics of Canada by Census Metropolitan Areas (CMAs) and Census Agglomerations (CAs). It is categorized also by citizen/immigration status, ethnic origin, religion, mobility, education, language, work, housing, income etc. There is detailed characteristics categorization within these stated categories that are in 5 layers. - [Detailed Canadian Population Statistics by FED 2011](https://www.johnsnowlabs.com/marketplace/detailed-canadian-population-statistics-by-fed-2011) - This dataset covers the population statistics of Canada from 2011 by Federal Electoral District of 2013 Representation Order. It is categorized also by citizen/immigration status, ethnic origin, religion, mobility, education, language, work, housing, income etc. There is detailed characteristics categorization within these stated categories that are in 5 layers. - [Detailed Canadian Population Statistics by Health Region 2011](https://www.johnsnowlabs.com/marketplace/detailed-canadian-population-statistics-by-health-region-2011) - This dataset covers the population statistics of Canada by health region. It is categorized also by citizen/immigration status, ethnic origin, religion, mobility, education, language, work, housing, income etc. There is detailed characteristics categorization within these stated categories that are in 5 layers. - [Detailed Canadian Population Statistics by Province 2011](https://www.johnsnowlabs.com/marketplace/detailed-canadian-population-statistics-by-province-2011) - This dataset covers the population statistics of Canada by provinces and territories. It is categorized also by citizen/immigration status, ethnic origin, religion, mobility, education, language, work, housing, income etc. There is detailed characteristics categorization within these stated categories that are in 5 layers. **Data Engineering Overview** **We deliver high-quality data** - Each dataset goes through 3 levels of quality review - 2 Manual reviews are done by domain experts - Then, an automated set of 60+ validations enforces every datum matches metadata & defined constraints - Data is normalized into one unified type system - All dates, unites, codes, currencies look the same - All null values are normalized to the same value - All dataset and field names are SQL and Hive compliant - Data and Metadata - Data is available in both CSV and Apache Parquet format, optimized for high read performance on distributed Hadoop, Spark & MPP clusters - Metadata is provided in the open Frictionless Data standard, and its every field is normalized & validated - Data Updates - Data updates support replace-on-update: outdated foreign keys are deprecated, not deleted **Our data is curated and enriched by domain experts** Each dataset is manually curated by our team of doctors, pharmacists, public health & medical billing experts: - Field names, descriptions, and normalized values are chosen by people who actually understand their meaning - Healthcare & life science experts add categories, search keywords, descriptions and more to each dataset - Both manual and automated data enrichment supported for clinical codes, providers, drugs, and geo-locations - The data is always kept up to date – even when the source requires manual effort to get updates - Support for data subscribers is provided directly by the domain experts who curated the data sets - Every data source’s license is manually verified to allow for royalty-free commercial use and redistribution. **Need Help?** If you have questions about our products, contact us at [info@johnsnowlabs.com](mailto:info@johnsnowlabs.com).

Databricks 收录

AIS数据集

该研究使用了多个公开的AIS数据集，这些数据集经过过滤、清理和统计分析。数据集涵盖了多种类型的船舶，并提供了关于船舶位置、速度和航向的关键信息。数据集包括来自19,185艘船舶的AIS消息，总计约6.4亿条记录。

github 收录

KAIST dataset

KAIST数据集，用于多光谱行人检测。

github 收录