atomic
收藏魔搭社区2025-11-27 更新2025-05-31 收录
下载链接:
https://modelscope.cn/datasets/allenai/atomic
下载链接
链接失效反馈官方服务:
资源简介:
# Dataset Card for An Atlas of Machine Commonsense for If-Then Reasoning - Atomic Common Sense Dataset
## Table of Contents
- [Dataset Description](#dataset-description)
- [Dataset Summary](#dataset-summary)
- [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards)
- [Languages](#languages)
- [Dataset Structure](#dataset-structure)
- [Data Instances](#data-instances)
- [Data Fields](#data-fields)
- [Data Splits](#data-splits)
- [Dataset Creation](#dataset-creation)
- [Curation Rationale](#curation-rationale)
- [Source Data](#source-data)
- [Annotations](#annotations)
- [Personal and Sensitive Information](#personal-and-sensitive-information)
- [Considerations for Using the Data](#considerations-for-using-the-data)
- [Social Impact of Dataset](#social-impact-of-dataset)
- [Discussion of Biases](#discussion-of-biases)
- [Other Known Limitations](#other-known-limitations)
- [Additional Information](#additional-information)
- [Dataset Curators](#dataset-curators)
- [Licensing Information](#licensing-information)
- [Citation Information](#citation-information)
- [Contributions](#contributions)
## Dataset Description
- **Homepage:**
https://homes.cs.washington.edu/~msap/atomic/
- **Repository:**
https://homes.cs.washington.edu/~msap/atomic/
- **Paper:**
Maarten Sap, Ronan LeBras, Emily Allaway, Chandra Bhagavatula, Nicholas Lourie, Hannah Rashkin, Brendan Roof, Noah A. Smith & Yejin Choi (2019). ATOMIC: An Atlas of Machine Commonsense for If-Then Reasoning. AAAI
### Dataset Summary
This dataset provides the template sentences and
relationships defined in the ATOMIC common sense dataset. There are
three splits - train, test, and dev.
Data can be downloaded here: [https://maartensap.com/atomic/data/atomic_data.tgz](https://maartensap.com/atomic/data/atomic_data.tgz)
Files present:
- `v4_atomic_all_agg.csv`: contains one event per line, with all annotations aggregated into one list (but not de-duplicated, so there might be repeats).
- `v4_atomic_all.csv`: keeps track of which worker did which annotations. Each line is the answers from one worker only, so there are multiple lines for the same event.
- `v4_atomic_trn.csv`, `v4_atomic_dev.csv`, `v4_atomic_tst.csv`: same as above, but split based on train/dev/test split.
All files are CSVs containing the following columns:
- event: just a string representation of the event.
- oEffect,oReact,oWant,xAttr,xEffect,xIntent,xNeed,xReact,xWant: annotations for each of the dimensions, stored in a json-dumped list of strings.
**Note**: `[""none""]` means the worker explicitly responded with the empty response, whereas `[]` means the worker did not annotate this dimension.
- prefix: json-dumped list that represents the prefix of content words (used to make a better trn/dev/tst split).
- split: string rep of which split the event belongs to.
Suggested code for loading the data into a pandas dataframe:
```python
import pandas as pd
import json
df = pd.read_csv("v4_atomic_all.csv",index_col=0)
df.iloc[:,:9] = df.iloc[:,:9].apply(lambda col: col.apply(json.loads))
```
**_Disclaimer/Content warning_**: the events in atomic have been automatically extracted from blogs, stories and books written at various times.
The events might depict violent or problematic actions, which we left in the corpus for the sake of learning the (probably negative but still important) commonsense implications associated with the events.
We removed a small set of truly out-dated events, but might have missed some so please email us (msap@cs.washington.edu) if you have any concerns.
### Supported Tasks and Leaderboards
[More Information Needed]
### Languages
en
## Dataset Structure
### Data Instances
Here is one example from the atomic dataset:
``
{'event': "PersonX uses PersonX's ___ to obtain", 'oEffect': [], 'oReact': ['annoyed', 'angry', 'worried'], 'oWant': [], 'prefix': ['uses', 'obtain'], 'split': 'trn', 'xAttr': [], 'xEffect': [], 'xIntent': ['to have an advantage', 'to fulfill a desire', 'to get out of trouble'], 'xNeed': [], 'xReact': ['pleased', 'smug', 'excited'], 'xWant': []}
``
### Data Fields
Notes from the authors:
* event: just a string representation of the event.
* oEffect,oReact,oWant,xAttr,xEffect,xIntent,xNeed,xReact,xWant: annotations for each of the dimensions, stored in a json-dumped string.
Note: "none" means the worker explicitly responded with the empty response, whereas [] means the worker did not annotate this dimension.
* prefix: json-dumped string that represents the prefix of content words (used to make a better trn/dev/tst split).
* split: string rep of which split the event belongs to.
### Data Splits
The atomic dataset has three splits: test, train and dev of the form:
## Dataset Creation
### Curation Rationale
This dataset was gathered and created over to assist in common sense reasoning.
### Source Data
#### Initial Data Collection and Normalization
See the reaserch paper and website for more detail. The dataset was
created by the University of Washington using crowd sourced data
#### Who are the source language producers?
The Atomic authors and crowd source.
### Annotations
#### Annotation process
Human annotations directed by forms.
#### Who are the annotators?
Human annotations.
### Personal and Sensitive Information
Unkown, but likely none.
## Considerations for Using the Data
### Social Impact of Dataset
The goal for the work is to help machines understand common sense.
### Discussion of Biases
Since the data is human annotators, there is likely to be baised. From the authors:
Disclaimer/Content warning: the events in atomic have been automatically extracted from blogs, stories and books written at various times. The events might depict violent or problematic actions, which we left in the corpus for the sake of learning the (probably negative but still important) commonsense implications associated with the events. We removed a small set of truly out-dated events, but might have missed some so please email us (msap@cs.washington.edu) if you have any concerns.
### Other Known Limitations
While there are many relationships, the data is quite sparse. Also, each item of the dataset could be expanded into multiple sentences along the vsrious dimensions, oEffect, oRect, etc.
For example, given event: "PersonX uses PersonX's ___ to obtain" and dimension oReact: "annoyed", this could be transformed into an entry:
"PersonX uses PersonX's ___ to obtain => PersonY is annoyed"
## Additional Information
### Dataset Curators
The authors of Aotmic at The University of Washington
### Licensing Information
The Creative Commons Attribution 4.0 International License. https://creativecommons.org/licenses/by/4.0/
### Citation Information
@article{Sap2019ATOMICAA,
title={ATOMIC: An Atlas of Machine Commonsense for If-Then Reasoning},
author={Maarten Sap and Ronan Le Bras and Emily Allaway and Chandra Bhagavatula and Nicholas Lourie and Hannah Rashkin and Brendan Roof and Noah A. Smith and Yejin Choi},
journal={ArXiv},
year={2019},
volume={abs/1811.00146}
}
### Contributions
Thanks to [@ontocord](https://github.com/ontocord) for adding this dataset.
# 用于条件推理的机器常识图谱数据集卡片 —— ATOMIC常识数据集
## 目录
- [数据集概述](#dataset-description)
- [数据集总结](#dataset-summary)
- [支持任务与评测基准](#supported-tasks-and-leaderboards)
- [语言](#languages)
- [数据集结构](#dataset-structure)
- [数据样例](#data-instances)
- [数据字段](#data-fields)
- [数据划分](#data-splits)
- [数据集构建](#dataset-creation)
- [构建初衷](#curation-rationale)
- [数据源](#source-data)
- [标注信息](#annotations)
- [个人与敏感信息](#personal-and-sensitive-information)
- [数据集使用注意事项](#considerations-for-using-the-data)
- [数据集的社会影响](#social-impact-of-dataset)
- [偏差讨论](#discussion-of-biases)
- [其他已知局限性](#other-known-limitations)
- [附加信息](#additional-information)
- [数据集维护者](#dataset-curators)
- [授权信息](#licensing-information)
- [引用信息](#citation-information)
- [贡献致谢](#contributions)
## 数据集概述
- **主页:** https://homes.cs.washington.edu/~msap/atomic/
- **代码仓库:** https://homes.cs.washington.edu/~msap/atomic/
- **相关论文:** Maarten Sap, Ronan LeBras, Emily Allaway, Chandra Bhagavatula, Nicholas Lourie, Hannah Rashkin, Brendan Roof, Noah A. Smith & Yejin Choi (2019). ATOMIC: An Atlas of Machine Commonsense for If-Then Reasoning. AAAI
### 数据集总结
本数据集包含ATOMIC常识数据集中定义的模板语句与关系。数据集共包含三个划分集:训练集(train)、测试集(test)与验证集(dev)。数据可通过以下链接下载:[https://maartensap.com/atomic/data/atomic_data.tgz](https://maartensap.com/atomic/data/atomic_data.tgz)
包含的文件如下:
- `v4_atomic_all_agg.csv`:每行对应一个事件,所有标注聚合为一个列表(未去重,可能存在重复条目)。
- `v4_atomic_all.csv`:记录每位标注者对应的标注结果,每行仅包含一位标注者的答案,因此同一事件可能对应多行数据。
- `v4_atomic_trn.csv`、`v4_atomic_dev.csv`、`v4_atomic_tst.csv`:与上述文件格式一致,但已按照训练/验证/测试划分拆分。
所有文件均为CSV格式,包含以下列:
- `event`:事件的字符串表示。
- `oEffect,oReact,oWant,xAttr,xEffect,xIntent,xNeed,xReact,xWant`:各维度的标注结果,以JSON序列化的字符串列表形式存储。**注意**:`[""none""]` 表示标注者明确返回空响应,而`[]`表示标注者未对该维度进行标注。
- `prefix`:JSON序列化的内容词前缀列表,用于实现更合理的训练/验证/测试集划分。
- `split`:事件所属划分集的字符串标识。
将数据加载至Pandas数据框的参考代码如下:
python
import pandas as pd
import json
df = pd.read_csv("v4_atomic_all.csv",index_col=0)
df.iloc[:,:9] = df.iloc[:,:9].apply(lambda col: col.apply(json.loads))
**免责声明/内容警示**:ATOMIC数据集中的事件均自动从不同时期的博客、故事与书籍中提取。部分事件可能涉及暴力或不当行为,我们保留这些内容以学习与之相关的(通常为负面但仍具有重要意义的)常识关联。我们已移除少量明显过时的事件,但可能存在遗漏,若您有任何疑虑,请发送邮件至(msap@cs.washington.edu)。
### 支持任务与评测基准
[需补充更多信息]
### 语言
英语
## 数据集结构
### 数据样例
以下为ATOMIC数据集中的一个样例:
{'event': "PersonX uses PersonX's ___ to obtain", 'oEffect': [], 'oReact': ['annoyed', 'angry', 'worried'], 'oWant': [], 'prefix': ['uses', 'obtain'], 'split': 'trn', 'xAttr': [], 'xEffect': [], 'xIntent': ['to have an advantage', 'to fulfill a desire', 'to get out of trouble'], 'xNeed': [], 'xReact': ['pleased', 'smug', 'excited'], 'xWant': []}
### 数据字段
作者备注:
* `event`:事件的字符串表示。
* `oEffect`(他人影响)、`oReact`(他人反应)、`oWant`(他人期望)、`xAttr`(主体属性)、`xEffect`(主体影响)、`xIntent`(主体意图)、`xNeed`(前置需求)、`xReact`(主体反应)、`xWant`(主体期望):各维度的标注结果,以JSON序列化的字符串列表形式存储。
**注意**:`["none"]` 表示标注者明确返回空响应,而`[]`表示标注者未对该维度进行标注。
* `prefix`:JSON序列化的内容词前缀列表,用于实现更合理的训练/验证/测试集划分。
* `split`:事件所属划分集的字符串标识。
### 数据划分
ATOMIC数据集共包含三个划分集:测试集、训练集与验证集,划分形式如下:
## 数据集构建
### 构建初衷
本数据集旨在助力常识推理研究,故进行收集与构建。
### 数据源
#### 初始数据收集与标准化
详细信息请参阅相关研究论文与项目主页。本数据集由华盛顿大学利用众包数据构建。
#### 数据源生成者
ATOMIC数据集的作者与众包标注者。
### 标注信息
#### 标注流程
基于标准化标注表单开展人工标注。
#### 标注者身份
人工标注者。
### 个人与敏感信息
暂未明确,但大概率不包含个人或敏感信息。
## 数据集使用注意事项
### 数据集的社会影响
本研究的目标是助力机器理解常识知识。
### 偏差讨论
由于数据来自人工标注,本数据集可能存在偏差。作者的说明如下:
> **免责声明/内容警示**:ATOMIC数据集中的事件均自动从不同时期的博客、故事与书籍中提取。部分事件可能涉及暴力或不当行为,我们保留这些内容以学习与之相关的(通常为负面但仍具有重要意义的)常识关联。我们已移除少量明显过时的事件,但可能存在遗漏,若您有任何疑虑,请发送邮件至(msap@cs.washington.edu)。
### 其他已知局限性
尽管数据集包含多种关系,但整体数据较为稀疏。此外,数据集的每个条目可根据不同维度(如`oEffect`、`oReact`等)扩展为多个语句。例如,给定事件`"PersonX uses PersonX's ___ to obtain"`与维度`oReact`的标注`"annoyed"`,可将其转换为如下条目:`"PersonX uses PersonX's ___ to obtain => PersonY feels annoyed"`。
## 附加信息
### 数据集维护者
华盛顿大学ATOMIC数据集的作者团队。
### 授权信息
知识共享署名4.0国际许可协议(Creative Commons Attribution 4.0 International License),详情见:https://creativecommons.org/licenses/by/4.0/
### 引用信息
bibtex
@article{Sap2019ATOMICAA,
title={ATOMIC: An Atlas of Machine Commonsense for If-Then Reasoning},
author={Maarten Sap and Ronan Le Bras and Emily Allaway and Chandra Bhagavatula and Nicholas Lourie and Hannah Rashkin and Brendan Roof and Noah A. Smith and Yejin Choi},
journal={ArXiv},
year={2019},
volume={abs/1811.00146}
}
### 贡献致谢
感谢[@ontocord](https://github.com/ontocord)贡献本数据集的录入工作。
提供机构:
maas
创建时间:
2025-05-28



