five

tuanphong/ascent_kb

收藏
Hugging Face2024-01-09 更新2024-05-25 收录
下载链接:
https://hf-mirror.com/datasets/tuanphong/ascent_kb
下载链接
链接失效反馈
官方服务:
资源简介:
--- annotations_creators: - found language_creators: - found language: - en license: - cc-by-4.0 multilinguality: - monolingual size_categories: - 1M<n<10M source_datasets: - original task_categories: - other task_ids: [] paperswithcode_id: ascentkb pretty_name: Ascent KB tags: - knowledge-base dataset_info: - config_name: canonical features: - name: arg1 dtype: string - name: rel dtype: string - name: arg2 dtype: string - name: support dtype: int64 - name: facets list: - name: value dtype: string - name: type dtype: string - name: support dtype: int64 - name: source_sentences list: - name: text dtype: string - name: source dtype: string splits: - name: train num_bytes: 2976665740 num_examples: 8904060 download_size: 898478552 dataset_size: 2976665740 - config_name: open features: - name: subject dtype: string - name: predicate dtype: string - name: object dtype: string - name: support dtype: int64 - name: facets list: - name: value dtype: string - name: type dtype: string - name: support dtype: int64 - name: source_sentences list: - name: text dtype: string - name: source dtype: string splits: - name: train num_bytes: 2882646222 num_examples: 8904060 download_size: 900156754 dataset_size: 2882646222 configs: - config_name: canonical data_files: - split: train path: canonical/train-* default: true - config_name: open data_files: - split: train path: open/train-* --- # Dataset Card for Ascent KB ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** https://ascent.mpi-inf.mpg.de/ - **Repository:** https://github.com/phongnt570/ascent - **Paper:** https://arxiv.org/abs/2011.00905 - **Point of Contact:** http://tuan-phong.com ### Dataset Summary This dataset contains 8.9M commonsense assertions extracted by the Ascent pipeline developed at the [Max Planck Institute for Informatics](https://www.mpi-inf.mpg.de/departments/databases-and-information-systems/). The focus of this dataset is on everyday concepts such as *elephant*, *car*, *laptop*, etc. The current version of Ascent KB (v1.0.0) is approximately **19 times larger than ConceptNet** (note that, in this comparison, non-commonsense knowledge in ConceptNet such as lexical relations is excluded). For more details, take a look at [the research paper](https://arxiv.org/abs/2011.00905) and [the website](https://ascent.mpi-inf.mpg.de). ### Supported Tasks and Leaderboards The dataset can be used in a wide range of downstream tasks such as commonsense question answering or dialogue systems. ### Languages The dataset is in English. ## Dataset Structure ### Data Instances There are two configurations available for this dataset: 1. `canonical` (default): This part contains `<arg1 ; rel ; arg2>` assertions where the relations (`rel`) were mapped to [ConceptNet relations](https://github.com/commonsense/conceptnet5/wiki/Relations) with slight modifications: - Introducing 2 new relations: `/r/HasSubgroup`, `/r/HasAspect`. - All `/r/HasA` relations were replaced with `/r/HasAspect`. This is motivated by the [ATOMIC-2020](https://allenai.org/data/atomic-2020) schema, although they grouped all `/r/HasA` and `/r/HasProperty` into `/r/HasProperty`. - The `/r/UsedFor` relation was replaced with `/r/ObjectUse` which is broader (could be either _"used for"_, _"used in"_, or _"used as"_, ect.). This is also taken from ATOMIC-2020. 2. `open`: This part contains open assertions of the form `<subject ; predicate ; object>` extracted directly from web contents. This is the original form of the `canonical` triples. In both configurations, each assertion is equipped with extra information including: a set of semantic `facets` (e.g., *LOCATION*, *TEMPORAL*, etc.), its `support` (i.e., number of occurrences), and a list of `source_sentences`. An example row in the `canonical` configuration: ```JSON { "arg1": "elephant", "rel": "/r/HasProperty", "arg2": "intelligent", "support": 15, "facets": [ { "value": "extremely", "type": "DEGREE", "support": 11 } ], "source_sentences": [ { "text": "Elephants are extremely intelligent animals.", "source": "https://www.softschools.com/facts/animals/asian_elephant_facts/2310/" }, { "text": "Elephants are extremely intelligent creatures and an elephant's brain can weigh as much as 4-6 kg.", "source": "https://www.elephantsforafrica.org/elephant-facts/" } ] } ``` ### Data Fields - **For `canonical` configuration** - `arg1`: the first argument to the relationship, e.g., *elephant* - `rel`: the canonical relation, e.g., */r/HasProperty* - `arg2`: the second argument to the relationship, e.g., *intelligence* - `support`: the number of occurrences of the assertion, e.g., *15* - `facets`: an array of semantic facets, each contains - `value`: facet value, e.g., *extremely* - `type`: facet type, e.g., *DEGREE* - `support`: the number of occurrences of the facet, e.g., *11* - `source_sentences`: an array of source sentences from which the assertion was extracted, each contains - `text`: the raw text of the sentence - `source`: the URL to its parent document - **For `open` configuration** - The fields of this configuration are the same as the `canonical` configuration's, except that the (`arg1`, `rel`, `arg2`) fields are replaced with the (`subject`, `predicate`, `object`) fields which are free text phrases extracted directly from the source sentences using an Open Information Extraction (OpenIE) tool. ### Data Splits There are no splits. All data points come to a default split called `train`. ## Dataset Creation ### Curation Rationale The commonsense knowledge base was created to assist in development of robust and reliable AI. ### Source Data #### Initial Data Collection and Normalization Texts were collected from the web using the Bing Search API, and went through various cleaning steps before being processed by an OpenIE tool to get open assertions. The assertions were then grouped into semantically equivalent clusters. Take a look at the research paper for more details. #### Who are the source language producers? Web users. ### Annotations #### Annotation process None. #### Who are the annotators? None. ### Personal and Sensitive Information Unknown. ## Considerations for Using the Data ### Social Impact of Dataset [Needs More Information] ### Discussion of Biases [Needs More Information] ### Other Known Limitations [Needs More Information] ## Additional Information ### Dataset Curators The knowledge base has been developed by researchers at the [Max Planck Institute for Informatics](https://www.mpi-inf.mpg.de/departments/databases-and-information-systems/). Contact [Tuan-Phong Nguyen](http://tuan-phong.com) in case of questions and comments. ### Licensing Information [The Creative Commons Attribution 4.0 International License](https://creativecommons.org/licenses/by/4.0/) ### Citation Information ``` @InProceedings{nguyen2021www, title={Advanced Semantics for Commonsense Knowledge Extraction}, author={Nguyen, Tuan-Phong and Razniewski, Simon and Weikum, Gerhard}, year={2021}, booktitle={The Web Conference 2021}, } ``` ### Contributions Thanks to [@phongnt570](https://github.com/phongnt570) for adding this dataset.
提供机构:
tuanphong
原始信息汇总

数据集概述

数据集名称: Ascent KB

数据集简介: Ascent KB是一个包含890万条常识性断言的数据集,由Max Planck Institute for Informatics开发的Ascent管道提取。该数据集专注于日常概念,如elephantcarlaptop等。

数据集大小: 数据集大小在1M到10M之间。

语言: 英语

许可证: CC-BY-4.0

多语言性: 单语种

任务类别: 其他

数据集结构

数据实例

  • 配置1(默认): canonical,包含<arg1 ; rel ; arg2>断言,其中关系rel映射到ConceptNet关系,并进行了一些修改。
  • 配置2: open,包含<subject ; predicate ; object>形式的开放断言,直接从网页内容提取。

数据字段

  • canonical配置:

    • arg1: 关系的第一参数
    • rel: 规范关系
    • arg2: 关系的第二参数
    • support: 断言的出现次数
    • facets: 语义方面,包括valuetypesupport
    • source_sentences: 断言提取的源句子列表,每个包含textsource
  • open配置:

    • canonical配置相同,但arg1relarg2字段替换为subjectpredicateobject

数据分割

  • 所有数据点属于默认分割train,无其他分割。

数据集创建

源数据

  • 文本从网页收集,通过Bing Search API,并经过清理步骤后由OpenIE工具处理。

许可证信息

引用信息

@InProceedings{nguyen2021www, title={Advanced Semantics for Commonsense Knowledge Extraction}, author={Nguyen, Tuan-Phong and Razniewski, Simon and Weikum, Gerhard}, year={2021}, booktitle={The Web Conference 2021}, }

搜集汇总
数据集介绍
main_image_url
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作