tuanphong/ascent_kb
收藏Hugging Face2024-01-09 更新2024-05-25 收录
下载链接:
https://hf-mirror.com/datasets/tuanphong/ascent_kb
下载链接
链接失效反馈官方服务:
资源简介:
---
annotations_creators:
- found
language_creators:
- found
language:
- en
license:
- cc-by-4.0
multilinguality:
- monolingual
size_categories:
- 1M<n<10M
source_datasets:
- original
task_categories:
- other
task_ids: []
paperswithcode_id: ascentkb
pretty_name: Ascent KB
tags:
- knowledge-base
dataset_info:
- config_name: canonical
features:
- name: arg1
dtype: string
- name: rel
dtype: string
- name: arg2
dtype: string
- name: support
dtype: int64
- name: facets
list:
- name: value
dtype: string
- name: type
dtype: string
- name: support
dtype: int64
- name: source_sentences
list:
- name: text
dtype: string
- name: source
dtype: string
splits:
- name: train
num_bytes: 2976665740
num_examples: 8904060
download_size: 898478552
dataset_size: 2976665740
- config_name: open
features:
- name: subject
dtype: string
- name: predicate
dtype: string
- name: object
dtype: string
- name: support
dtype: int64
- name: facets
list:
- name: value
dtype: string
- name: type
dtype: string
- name: support
dtype: int64
- name: source_sentences
list:
- name: text
dtype: string
- name: source
dtype: string
splits:
- name: train
num_bytes: 2882646222
num_examples: 8904060
download_size: 900156754
dataset_size: 2882646222
configs:
- config_name: canonical
data_files:
- split: train
path: canonical/train-*
default: true
- config_name: open
data_files:
- split: train
path: open/train-*
---
# Dataset Card for Ascent KB
## Table of Contents
- [Dataset Description](#dataset-description)
- [Dataset Summary](#dataset-summary)
- [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards)
- [Languages](#languages)
- [Dataset Structure](#dataset-structure)
- [Data Instances](#data-instances)
- [Data Fields](#data-fields)
- [Data Splits](#data-splits)
- [Dataset Creation](#dataset-creation)
- [Curation Rationale](#curation-rationale)
- [Source Data](#source-data)
- [Annotations](#annotations)
- [Personal and Sensitive Information](#personal-and-sensitive-information)
- [Considerations for Using the Data](#considerations-for-using-the-data)
- [Social Impact of Dataset](#social-impact-of-dataset)
- [Discussion of Biases](#discussion-of-biases)
- [Other Known Limitations](#other-known-limitations)
- [Additional Information](#additional-information)
- [Dataset Curators](#dataset-curators)
- [Licensing Information](#licensing-information)
- [Citation Information](#citation-information)
- [Contributions](#contributions)
## Dataset Description
- **Homepage:** https://ascent.mpi-inf.mpg.de/
- **Repository:** https://github.com/phongnt570/ascent
- **Paper:** https://arxiv.org/abs/2011.00905
- **Point of Contact:** http://tuan-phong.com
### Dataset Summary
This dataset contains 8.9M commonsense assertions extracted by the Ascent pipeline developed at the [Max Planck Institute for Informatics](https://www.mpi-inf.mpg.de/departments/databases-and-information-systems/).
The focus of this dataset is on everyday concepts such as *elephant*, *car*, *laptop*, etc.
The current version of Ascent KB (v1.0.0) is approximately **19 times larger than ConceptNet** (note that, in this comparison, non-commonsense knowledge in ConceptNet such as lexical relations is excluded).
For more details, take a look at
[the research paper](https://arxiv.org/abs/2011.00905) and
[the website](https://ascent.mpi-inf.mpg.de).
### Supported Tasks and Leaderboards
The dataset can be used in a wide range of downstream tasks such as commonsense question answering or dialogue systems.
### Languages
The dataset is in English.
## Dataset Structure
### Data Instances
There are two configurations available for this dataset:
1. `canonical` (default): This part contains `<arg1 ; rel ; arg2>`
assertions where the relations (`rel`) were mapped to
[ConceptNet relations](https://github.com/commonsense/conceptnet5/wiki/Relations)
with slight modifications:
- Introducing 2 new relations: `/r/HasSubgroup`, `/r/HasAspect`.
- All `/r/HasA` relations were replaced with `/r/HasAspect`.
This is motivated by the [ATOMIC-2020](https://allenai.org/data/atomic-2020)
schema, although they grouped all `/r/HasA` and
`/r/HasProperty` into `/r/HasProperty`.
- The `/r/UsedFor` relation was replaced with `/r/ObjectUse`
which is broader (could be either _"used for"_, _"used in"_, or _"used as"_, ect.).
This is also taken from ATOMIC-2020.
2. `open`: This part contains open assertions of the form
`<subject ; predicate ; object>` extracted directly from web
contents. This is the original form of the `canonical` triples.
In both configurations, each assertion is equipped with
extra information including: a set of semantic `facets`
(e.g., *LOCATION*, *TEMPORAL*, etc.), its `support` (i.e., number of occurrences),
and a list of `source_sentences`.
An example row in the `canonical` configuration:
```JSON
{
"arg1": "elephant",
"rel": "/r/HasProperty",
"arg2": "intelligent",
"support": 15,
"facets": [
{
"value": "extremely",
"type": "DEGREE",
"support": 11
}
],
"source_sentences": [
{
"text": "Elephants are extremely intelligent animals.",
"source": "https://www.softschools.com/facts/animals/asian_elephant_facts/2310/"
},
{
"text": "Elephants are extremely intelligent creatures and an elephant's brain can weigh as much as 4-6 kg.",
"source": "https://www.elephantsforafrica.org/elephant-facts/"
}
]
}
```
### Data Fields
- **For `canonical` configuration**
- `arg1`: the first argument to the relationship, e.g., *elephant*
- `rel`: the canonical relation, e.g., */r/HasProperty*
- `arg2`: the second argument to the relationship, e.g., *intelligence*
- `support`: the number of occurrences of the assertion, e.g., *15*
- `facets`: an array of semantic facets, each contains
- `value`: facet value, e.g., *extremely*
- `type`: facet type, e.g., *DEGREE*
- `support`: the number of occurrences of the facet, e.g., *11*
- `source_sentences`: an array of source sentences from which the assertion was
extracted, each contains
- `text`: the raw text of the sentence
- `source`: the URL to its parent document
- **For `open` configuration**
- The fields of this configuration are the same as the `canonical`
configuration's, except that
the (`arg1`, `rel`, `arg2`) fields are replaced with the
(`subject`, `predicate`, `object`) fields
which are free
text phrases extracted directly from the source sentences
using an Open Information Extraction (OpenIE) tool.
### Data Splits
There are no splits. All data points come to a default split called `train`.
## Dataset Creation
### Curation Rationale
The commonsense knowledge base was created to assist in development of robust and reliable AI.
### Source Data
#### Initial Data Collection and Normalization
Texts were collected from the web using the Bing Search API, and went through various cleaning steps before being processed by an OpenIE tool to get open assertions.
The assertions were then grouped into semantically equivalent clusters.
Take a look at the research paper for more details.
#### Who are the source language producers?
Web users.
### Annotations
#### Annotation process
None.
#### Who are the annotators?
None.
### Personal and Sensitive Information
Unknown.
## Considerations for Using the Data
### Social Impact of Dataset
[Needs More Information]
### Discussion of Biases
[Needs More Information]
### Other Known Limitations
[Needs More Information]
## Additional Information
### Dataset Curators
The knowledge base has been developed by researchers at the
[Max Planck Institute for Informatics](https://www.mpi-inf.mpg.de/departments/databases-and-information-systems/).
Contact [Tuan-Phong Nguyen](http://tuan-phong.com) in case of questions and comments.
### Licensing Information
[The Creative Commons Attribution 4.0 International License](https://creativecommons.org/licenses/by/4.0/)
### Citation Information
```
@InProceedings{nguyen2021www,
title={Advanced Semantics for Commonsense Knowledge Extraction},
author={Nguyen, Tuan-Phong and Razniewski, Simon and Weikum, Gerhard},
year={2021},
booktitle={The Web Conference 2021},
}
```
### Contributions
Thanks to [@phongnt570](https://github.com/phongnt570) for adding this dataset.
提供机构:
tuanphong
原始信息汇总
数据集概述
数据集名称: Ascent KB
数据集简介: Ascent KB是一个包含890万条常识性断言的数据集,由Max Planck Institute for Informatics开发的Ascent管道提取。该数据集专注于日常概念,如elephant、car、laptop等。
数据集大小: 数据集大小在1M到10M之间。
语言: 英语
许可证: CC-BY-4.0
多语言性: 单语种
任务类别: 其他
数据集结构
数据实例
- 配置1(默认):
canonical,包含<arg1 ; rel ; arg2>断言,其中关系rel映射到ConceptNet关系,并进行了一些修改。 - 配置2:
open,包含<subject ; predicate ; object>形式的开放断言,直接从网页内容提取。
数据字段
-
canonical配置:arg1: 关系的第一参数rel: 规范关系arg2: 关系的第二参数support: 断言的出现次数facets: 语义方面,包括value、type和supportsource_sentences: 断言提取的源句子列表,每个包含text和source
-
open配置:- 与
canonical配置相同,但arg1、rel、arg2字段替换为subject、predicate、object
- 与
数据分割
- 所有数据点属于默认分割
train,无其他分割。
数据集创建
源数据
- 文本从网页收集,通过Bing Search API,并经过清理步骤后由OpenIE工具处理。
许可证信息
引用信息
@InProceedings{nguyen2021www, title={Advanced Semantics for Commonsense Knowledge Extraction}, author={Nguyen, Tuan-Phong and Razniewski, Simon and Weikum, Gerhard}, year={2021}, booktitle={The Web Conference 2021}, }
搜集汇总
数据集介绍

以上内容由遇见数据集搜集并总结生成



