midas/kdd

Name: midas/kdd
Creator: midas
Published: 2022-03-05 04:06:21
License: 暂无描述

Hugging Face2022-03-05 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/midas/kdd

下载链接

链接失效反馈

官方服务：

资源简介：

## Dataset Summary A dataset for benchmarking keyphrase extraction and generation techniques from abstracts of english scientific papers. For more details about the dataset please refer the original paper - [https://aclanthology.org/D14-1150.pdf](https://aclanthology.org/D14-1150.pdf) Original source of the data - []() ## Dataset Structure ### Data Fields - **id**: unique identifier of the document. - **document**: Whitespace separated list of words in the document. - **doc_bio_tags**: BIO tags for each word in the document. B stands for the beginning of a keyphrase and I stands for inside the keyphrase. O stands for outside the keyphrase and represents the word that isn't a part of the keyphrase at all. - **extractive_keyphrases**: List of all the present keyphrases. - **abstractive_keyphrase**: List of all the absent keyphrases. ### Data Splits |Split| #datapoints | |--|--| | Test | 755 | - Percentage of keyphrases that are named entities: 56.99% (named entities detected using scispacy - en-core-sci-lg model) - Percentage of keyphrases that are noun phrases: 54.99% (noun phrases detected using spacy en-core-web-lg after removing determiners) ## Usage ### Full Dataset ```python from datasets import load_dataset # get entire dataset dataset = load_dataset("midas/kdd", "raw") # sample from the test split print("Sample from test dataset split") test_sample = dataset["test"][0] print("Fields in the sample: ", [key for key in test_sample.keys()]) print("Tokenized Document: ", test_sample["document"]) print("Document BIO Tags: ", test_sample["doc_bio_tags"]) print("Extractive/present Keyphrases: ", test_sample["extractive_keyphrases"]) print("Abstractive/absent Keyphrases: ", test_sample["abstractive_keyphrases"]) print("\n-----------\n") ``` **Output** ```bash Sample from test data split Fields in the sample: ['id', 'document', 'doc_bio_tags', 'extractive_keyphrases', 'abstractive_keyphrases', 'other_metadata'] Tokenized Document: ['Discovering', 'roll-up', 'dependencies'] Document BIO Tags: ['O', 'O', 'O'] Extractive/present Keyphrases: [] Abstractive/absent Keyphrases: ['logical design'] ----------- ``` ### Keyphrase Extraction ```python from datasets import load_dataset # get the dataset only for keyphrase extraction dataset = load_dataset("midas/kdd", "extraction") print("Samples for Keyphrase Extraction") # sample from the test split print("Sample from test data split") test_sample = dataset["test"][0] print("Fields in the sample: ", [key for key in test_sample.keys()]) print("Tokenized Document: ", test_sample["document"]) print("Document BIO Tags: ", test_sample["doc_bio_tags"]) print("\n-----------\n") ``` ### Keyphrase Generation ```python # get the dataset only for keyphrase generation dataset = load_dataset("midas/kdd", "generation") print("Samples for Keyphrase Generation") # sample from the test split print("Sample from test data split") test_sample = dataset["test"][0] print("Fields in the sample: ", [key for key in test_sample.keys()]) print("Tokenized Document: ", test_sample["document"]) print("Extractive/present Keyphrases: ", test_sample["extractive_keyphrases"]) print("Abstractive/absent Keyphrases: ", test_sample["abstractive_keyphrases"]) print("\n-----------\n") ``` ## Citation Information ``` @inproceedings{caragea-etal-2014-citation, title = "Citation-Enhanced Keyphrase Extraction from Research Papers: A Supervised Approach", author = "Caragea, Cornelia and Bulgarov, Florin Adrian and Godea, Andreea and Das Gollapalli, Sujatha", booktitle = "Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing ({EMNLP})", month = oct, year = "2014", address = "Doha, Qatar", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/D14-1150", doi = "10.3115/v1/D14-1150", pages = "1435--1446", } ``` ## Contributions Thanks to [@debanjanbhucs](https://github.com/debanjanbhucs), [@dibyaaaaax](https://github.com/dibyaaaaax) and [@ad6398](https://github.com/ad6398) for adding this dataset

提供机构：

midas

原始信息汇总

数据集概述

本数据集用于评估英文科学论文摘要中的关键词提取和生成技术。详细信息请参考原始论文：https://aclanthology.org/D14-1150.pdf。

数据集结构

数据字段

id: 文档的唯一标识符。
document: 文档中的单词以空格分隔的列表。
doc_bio_tags: 文档中每个单词的BIO标签。B表示关键词的开始，I表示关键词内部，O表示非关键词部分。
extractive_keyphrases: 当前存在的关键词列表。
abstractive_keyphrase: 当前不存在的关键词列表。

数据分割

分割	数据点数量
测试	755

关键词中命名实体的百分比：56.99%（使用scispacy - en-core-sci-lg模型检测）
关键词中名词短语的百分比：54.99%（使用spacy en-core-web-lg模型检测，去除限定词后）

使用方法

完整数据集

python from datasets import load_dataset

获取整个数据集

dataset = load_dataset("midas/kdd", "raw")

从测试分割中采样

print("Sample from test dataset split") test_sample = dataset["test"][0] print("Fields in the sample: ", [key for key in test_sample.keys()]) print("Tokenized Document: ", test_sample["document"]) print("Document BIO Tags: ", test_sample["doc_bio_tags"]) print("Extractive/present Keyphrases: ", test_sample["extractive_keyphrases"]) print("Abstractive/absent Keyphrases: ", test_sample["abstractive_keyphrases"]) print("

关键词提取

python from datasets import load_dataset

仅获取关键词提取的数据集

dataset = load_dataset("midas/kdd", "extraction")

print("Samples for Keyphrase Extraction")

从测试分割中采样

print("Sample from test data split") test_sample = dataset["test"][0] print("Fields in the sample: ", [key for key in test_sample.keys()]) print("Tokenized Document: ", test_sample["document"]) print("Document BIO Tags: ", test_sample["doc_bio_tags"]) print("

关键词生成

python

仅获取关键词生成的数据集

dataset = load_dataset("midas/kdd", "generation")

print("Samples for Keyphrase Generation")

从测试分割中采样

print("Sample from test data split") test_sample = dataset["test"][0] print("Fields in the sample: ", [key for key in test_sample.keys()]) print("Tokenized Document: ", test_sample["document"]) print("Extractive/present Keyphrases: ", test_sample["extractive_keyphrases"]) print("Abstractive/absent Keyphrases: ", test_sample["abstractive_keyphrases"]) print("

引用信息

@inproceedings{caragea-etal-2014-citation, title = "Citation-Enhanced Keyphrase Extraction from Research Papers: A Supervised Approach", author = "Caragea, Cornelia and Bulgarov, Florin Adrian and Godea, Andreea and Das Gollapalli, Sujatha", booktitle = "Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing ({EMNLP})", month = oct, year = "2014", address = "Doha, Qatar", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/D14-1150", doi = "10.3115/v1/D14-1150", pages = "1435--1446", }

5,000+

优质数据集

54 个

任务类型

进入经典数据集