midas/kp20k

Name: midas/kp20k
Creator: midas
Published: 2023-09-25 05:14:59
License: 暂无描述

Hugging Face2023-09-25 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/midas/kp20k

下载链接

链接失效反馈

官方服务：

资源简介：

该数据集用于评估从英文科学论文摘要中提取和生成关键词的技术。数据集包含唯一的文档标识符、文档内容、文档的BIO标签、提取的关键词和抽象的关键词。数据集分为训练集、测试集和验证集，分别包含530,809、20,000和20,000个数据点。

This dataset is designed to evaluate techniques for keyword extraction and generation from English scientific paper abstracts. It includes unique document identifiers, document content, document BIO tags, extracted keywords, and abstract keywords. The dataset is split into training, test, and validation sets, which contain 530,809, 20,000, and 20,000 data points respectively.

提供机构：

midas

原始信息汇总

数据集概述

数据集目的

用于评估和比较英文科学论文摘要的关键词提取和生成技术。

数据集结构

id: 文档的唯一标识符。
document: 文档中单词的空格分隔列表。
doc_bio_tags: 文档中每个单词的BIO标签，其中B表示关键词的开始，I表示关键词内部，O表示非关键词部分。
extractive_keyphrases: 文档中现有的关键词列表。
abstractive_keyphrase: 文档中不存在的关键词列表。

数据集统计

分割	数据点数量
训练	530,809
测试	20,000
验证	20,000

数据集使用

全数据集加载: 使用load_dataset("midas/kp20k", "raw")加载整个数据集。
关键词提取: 使用load_dataset("midas/kp20k", "extraction")加载仅用于关键词提取的数据集。
关键词生成: 使用load_dataset("midas/kp20k", "generation")加载仅用于关键词生成的数据集。

引用信息

@InProceedings{meng-EtAl:2017:Long, author = {Meng, Rui and Zhao, Sanqiang and Han, Shuguang and He, Daqing and Brusilovsky, Peter and Chi, Yu}, title = {Deep Keyphrase Generation}, booktitle = {Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)}, month = {July}, year = {2017}, address = {Vancouver, Canada}, publisher = {Association for Computational Linguistics}, pages = {582--592}, url = {http://aclweb.org/anthology/P17-1054} }

@article{mahata2022ldkp, title={LDKP: A Dataset for Identifying Keyphrases from Long Scientific Documents}, author={Mahata, Debanjan and Agarwal, Navneet and Gautam, Dibya and Kumar, Amardeep and Parekh, Swapnil and Singla, Yaman Kumar and Acharya, Anish and Shah, Rajiv Ratn}, journal={arXiv preprint arXiv:2203.15349}, year={2022} }

搜集汇总

数据集介绍

背景与挑战

背景概述

midas/kp20k是一个用于评估从英文科学论文摘要中提取和生成关键词短语技术的数据集，包含大量训练、测试和验证样本，每个样本提供文档内容、BIO标签及两类关键词短语信息。

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集