five

midas/kdd

收藏
Hugging Face2022-03-05 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/midas/kdd
下载链接
链接失效反馈
官方服务:
资源简介:
## Dataset Summary A dataset for benchmarking keyphrase extraction and generation techniques from abstracts of english scientific papers. For more details about the dataset please refer the original paper - [https://aclanthology.org/D14-1150.pdf](https://aclanthology.org/D14-1150.pdf) Original source of the data - []() ## Dataset Structure ### Data Fields - **id**: unique identifier of the document. - **document**: Whitespace separated list of words in the document. - **doc_bio_tags**: BIO tags for each word in the document. B stands for the beginning of a keyphrase and I stands for inside the keyphrase. O stands for outside the keyphrase and represents the word that isn't a part of the keyphrase at all. - **extractive_keyphrases**: List of all the present keyphrases. - **abstractive_keyphrase**: List of all the absent keyphrases. ### Data Splits |Split| #datapoints | |--|--| | Test | 755 | - Percentage of keyphrases that are named entities: 56.99% (named entities detected using scispacy - en-core-sci-lg model) - Percentage of keyphrases that are noun phrases: 54.99% (noun phrases detected using spacy en-core-web-lg after removing determiners) ## Usage ### Full Dataset ```python from datasets import load_dataset # get entire dataset dataset = load_dataset("midas/kdd", "raw") # sample from the test split print("Sample from test dataset split") test_sample = dataset["test"][0] print("Fields in the sample: ", [key for key in test_sample.keys()]) print("Tokenized Document: ", test_sample["document"]) print("Document BIO Tags: ", test_sample["doc_bio_tags"]) print("Extractive/present Keyphrases: ", test_sample["extractive_keyphrases"]) print("Abstractive/absent Keyphrases: ", test_sample["abstractive_keyphrases"]) print("\n-----------\n") ``` **Output** ```bash Sample from test data split Fields in the sample: ['id', 'document', 'doc_bio_tags', 'extractive_keyphrases', 'abstractive_keyphrases', 'other_metadata'] Tokenized Document: ['Discovering', 'roll-up', 'dependencies'] Document BIO Tags: ['O', 'O', 'O'] Extractive/present Keyphrases: [] Abstractive/absent Keyphrases: ['logical design'] ----------- ``` ### Keyphrase Extraction ```python from datasets import load_dataset # get the dataset only for keyphrase extraction dataset = load_dataset("midas/kdd", "extraction") print("Samples for Keyphrase Extraction") # sample from the test split print("Sample from test data split") test_sample = dataset["test"][0] print("Fields in the sample: ", [key for key in test_sample.keys()]) print("Tokenized Document: ", test_sample["document"]) print("Document BIO Tags: ", test_sample["doc_bio_tags"]) print("\n-----------\n") ``` ### Keyphrase Generation ```python # get the dataset only for keyphrase generation dataset = load_dataset("midas/kdd", "generation") print("Samples for Keyphrase Generation") # sample from the test split print("Sample from test data split") test_sample = dataset["test"][0] print("Fields in the sample: ", [key for key in test_sample.keys()]) print("Tokenized Document: ", test_sample["document"]) print("Extractive/present Keyphrases: ", test_sample["extractive_keyphrases"]) print("Abstractive/absent Keyphrases: ", test_sample["abstractive_keyphrases"]) print("\n-----------\n") ``` ## Citation Information ``` @inproceedings{caragea-etal-2014-citation, title = "Citation-Enhanced Keyphrase Extraction from Research Papers: A Supervised Approach", author = "Caragea, Cornelia and Bulgarov, Florin Adrian and Godea, Andreea and Das Gollapalli, Sujatha", booktitle = "Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing ({EMNLP})", month = oct, year = "2014", address = "Doha, Qatar", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/D14-1150", doi = "10.3115/v1/D14-1150", pages = "1435--1446", } ``` ## Contributions Thanks to [@debanjanbhucs](https://github.com/debanjanbhucs), [@dibyaaaaax](https://github.com/dibyaaaaax) and [@ad6398](https://github.com/ad6398) for adding this dataset
提供机构:
midas
原始信息汇总

数据集概述

本数据集用于评估英文科学论文摘要中的关键词提取和生成技术。详细信息请参考原始论文:https://aclanthology.org/D14-1150.pdf

数据集结构

数据字段

  • id: 文档的唯一标识符。
  • document: 文档中的单词以空格分隔的列表。
  • doc_bio_tags: 文档中每个单词的BIO标签。B表示关键词的开始,I表示关键词内部,O表示非关键词部分。
  • extractive_keyphrases: 当前存在的关键词列表。
  • abstractive_keyphrase: 当前不存在的关键词列表。

数据分割

分割 数据点数量
测试 755
  • 关键词中命名实体的百分比:56.99%(使用scispacy - en-core-sci-lg模型检测)
  • 关键词中名词短语的百分比:54.99%(使用spacy en-core-web-lg模型检测,去除限定词后)

使用方法

完整数据集

python from datasets import load_dataset

获取整个数据集

dataset = load_dataset("midas/kdd", "raw")

从测试分割中采样

print("Sample from test dataset split") test_sample = dataset["test"][0] print("Fields in the sample: ", [key for key in test_sample.keys()]) print("Tokenized Document: ", test_sample["document"]) print("Document BIO Tags: ", test_sample["doc_bio_tags"]) print("Extractive/present Keyphrases: ", test_sample["extractive_keyphrases"]) print("Abstractive/absent Keyphrases: ", test_sample["abstractive_keyphrases"]) print("

")

关键词提取

python from datasets import load_dataset

仅获取关键词提取的数据集

dataset = load_dataset("midas/kdd", "extraction")

print("Samples for Keyphrase Extraction")

从测试分割中采样

print("Sample from test data split") test_sample = dataset["test"][0] print("Fields in the sample: ", [key for key in test_sample.keys()]) print("Tokenized Document: ", test_sample["document"]) print("Document BIO Tags: ", test_sample["doc_bio_tags"]) print("

")

关键词生成

python

仅获取关键词生成的数据集

dataset = load_dataset("midas/kdd", "generation")

print("Samples for Keyphrase Generation")

从测试分割中采样

print("Sample from test data split") test_sample = dataset["test"][0] print("Fields in the sample: ", [key for key in test_sample.keys()]) print("Tokenized Document: ", test_sample["document"]) print("Extractive/present Keyphrases: ", test_sample["extractive_keyphrases"]) print("Abstractive/absent Keyphrases: ", test_sample["abstractive_keyphrases"]) print("

")

引用信息

@inproceedings{caragea-etal-2014-citation, title = "Citation-Enhanced Keyphrase Extraction from Research Papers: A Supervised Approach", author = "Caragea, Cornelia and Bulgarov, Florin Adrian and Godea, Andreea and Das Gollapalli, Sujatha", booktitle = "Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing ({EMNLP})", month = oct, year = "2014", address = "Doha, Qatar", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/D14-1150", doi = "10.3115/v1/D14-1150", pages = "1435--1446", }

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作