midas/kdd
收藏数据集概述
本数据集用于评估英文科学论文摘要中的关键词提取和生成技术。详细信息请参考原始论文:https://aclanthology.org/D14-1150.pdf。
数据集结构
数据字段
- id: 文档的唯一标识符。
- document: 文档中的单词以空格分隔的列表。
- doc_bio_tags: 文档中每个单词的BIO标签。B表示关键词的开始,I表示关键词内部,O表示非关键词部分。
- extractive_keyphrases: 当前存在的关键词列表。
- abstractive_keyphrase: 当前不存在的关键词列表。
数据分割
| 分割 | 数据点数量 |
|---|---|
| 测试 | 755 |
- 关键词中命名实体的百分比:56.99%(使用scispacy - en-core-sci-lg模型检测)
- 关键词中名词短语的百分比:54.99%(使用spacy en-core-web-lg模型检测,去除限定词后)
使用方法
完整数据集
python from datasets import load_dataset
获取整个数据集
dataset = load_dataset("midas/kdd", "raw")
从测试分割中采样
print("Sample from test dataset split") test_sample = dataset["test"][0] print("Fields in the sample: ", [key for key in test_sample.keys()]) print("Tokenized Document: ", test_sample["document"]) print("Document BIO Tags: ", test_sample["doc_bio_tags"]) print("Extractive/present Keyphrases: ", test_sample["extractive_keyphrases"]) print("Abstractive/absent Keyphrases: ", test_sample["abstractive_keyphrases"]) print("
")
关键词提取
python from datasets import load_dataset
仅获取关键词提取的数据集
dataset = load_dataset("midas/kdd", "extraction")
print("Samples for Keyphrase Extraction")
从测试分割中采样
print("Sample from test data split") test_sample = dataset["test"][0] print("Fields in the sample: ", [key for key in test_sample.keys()]) print("Tokenized Document: ", test_sample["document"]) print("Document BIO Tags: ", test_sample["doc_bio_tags"]) print("
")
关键词生成
python
仅获取关键词生成的数据集
dataset = load_dataset("midas/kdd", "generation")
print("Samples for Keyphrase Generation")
从测试分割中采样
print("Sample from test data split") test_sample = dataset["test"][0] print("Fields in the sample: ", [key for key in test_sample.keys()]) print("Tokenized Document: ", test_sample["document"]) print("Extractive/present Keyphrases: ", test_sample["extractive_keyphrases"]) print("Abstractive/absent Keyphrases: ", test_sample["abstractive_keyphrases"]) print("
")
引用信息
@inproceedings{caragea-etal-2014-citation, title = "Citation-Enhanced Keyphrase Extraction from Research Papers: A Supervised Approach", author = "Caragea, Cornelia and Bulgarov, Florin Adrian and Godea, Andreea and Das Gollapalli, Sujatha", booktitle = "Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing ({EMNLP})", month = oct, year = "2014", address = "Doha, Qatar", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/D14-1150", doi = "10.3115/v1/D14-1150", pages = "1435--1446", }



