five

midas/ldkp10k

收藏
Hugging Face2022-04-02 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/midas/ldkp10k
下载链接
链接失效反馈
官方服务:
资源简介:
A dataset for benchmarking keyphrase extraction and generation techniques from long document English scientific papers. For more details about the dataset please refer the original paper - [](). Data source - []() ## Dataset Summary ## Dataset Structure ### Data Fields - **id**: unique identifier of the document. - **sections**: list of all the sections present in the document. - **sec_text**: list of white space separated list of words present in each section. - **sec_bio_tags**: list of BIO tags of white space separated list of words present in each section. - **extractive_keyphrases**: List of all the present keyphrases. - **abstractive_keyphrase**: List of all the absent keyphrases. ### Data Splits |Split| #datapoints | |--|--| | Train-Small | 20,000 | | Train-Medium | 50,000 | | Train-Large | 1,296,613 | | Test | 10,000 | | Validation | 10,000 | ## Usage ### Small Dataset ```python from datasets import load_dataset # get small dataset dataset = load_dataset("midas/ldkp10k", "small") def order_sections(sample): """ corrects the order in which different sections appear in the document. resulting order is: title, abstract, other sections in the body """ sections = [] sec_text = [] sec_bio_tags = [] if "title" in sample["sections"]: title_idx = sample["sections"].index("title") sections.append(sample["sections"].pop(title_idx)) sec_text.append(sample["sec_text"].pop(title_idx)) sec_bio_tags.append(sample["sec_bio_tags"].pop(title_idx)) if "abstract" in sample["sections"]: abstract_idx = sample["sections"].index("abstract") sections.append(sample["sections"].pop(abstract_idx)) sec_text.append(sample["sec_text"].pop(abstract_idx)) sec_bio_tags.append(sample["sec_bio_tags"].pop(abstract_idx)) sections += sample["sections"] sec_text += sample["sec_text"] sec_bio_tags += sample["sec_bio_tags"] return sections, sec_text, sec_bio_tags # sample from the train split print("Sample from train data split") train_sample = dataset["train"][0] sections, sec_text, sec_bio_tags = order_sections(train_sample) print("Fields in the sample: ", [key for key in train_sample.keys()]) print("Section names: ", sections) print("Tokenized Document: ", sec_text) print("Document BIO Tags: ", sec_bio_tags) print("Extractive/present Keyphrases: ", train_sample["extractive_keyphrases"]) print("Abstractive/absent Keyphrases: ", train_sample["abstractive_keyphrases"]) print("\n-----------\n") # sample from the validation split print("Sample from validation data split") validation_sample = dataset["validation"][0] sections, sec_text, sec_bio_tags = order_sections(validation_sample) print("Fields in the sample: ", [key for key in validation_sample.keys()]) print("Section names: ", sections) print("Tokenized Document: ", sec_text) print("Document BIO Tags: ", sec_bio_tags) print("Extractive/present Keyphrases: ", validation_sample["extractive_keyphrases"]) print("Abstractive/absent Keyphrases: ", validation_sample["abstractive_keyphrases"]) print("\n-----------\n") # sample from the test split print("Sample from test data split") test_sample = dataset["test"][0] sections, sec_text, sec_bio_tags = order_sections(test_sample) print("Fields in the sample: ", [key for key in test_sample.keys()]) print("Section names: ", sections) print("Tokenized Document: ", sec_text) print("Document BIO Tags: ", sec_bio_tags) print("Extractive/present Keyphrases: ", test_sample["extractive_keyphrases"]) print("Abstractive/absent Keyphrases: ", test_sample["abstractive_keyphrases"]) print("\n-----------\n") ``` **Output** ```bash ``` ### Medium Dataset ```python from datasets import load_dataset # get medium dataset dataset = load_dataset("midas/ldkp10k", "medium") ``` ### Large Dataset ```python from datasets import load_dataset # get large dataset dataset = load_dataset("midas/ldkp10k", "large") ``` ## Citation Information Please cite the works below if you use this dataset in your work. ``` @article{mahata2022ldkp, title={LDKP: A Dataset for Identifying Keyphrases from Long Scientific Documents}, author={Mahata, Debanjan and Agarwal, Naveen and Gautam, Dibya and Kumar, Amardeep and Parekh, Swapnil and Singla, Yaman Kumar and Acharya, Anish and Shah, Rajiv Ratn}, journal={arXiv preprint arXiv:2203.15349}, year={2022} } ``` ``` @article{lo2019s2orc, title={S2ORC: The semantic scholar open research corpus}, author={Lo, Kyle and Wang, Lucy Lu and Neumann, Mark and Kinney, Rodney and Weld, Dan S}, journal={arXiv preprint arXiv:1911.02782}, year={2019} } ``` ``` @inproceedings{ccano2019keyphrase, title={Keyphrase generation: A multi-aspect survey}, author={{\c{C}}ano, Erion and Bojar, Ond{\v{r}}ej}, booktitle={2019 25th Conference of Open Innovations Association (FRUCT)}, pages={85--94}, year={2019}, organization={IEEE} } ``` ``` @article{meng2017deep, title={Deep keyphrase generation}, author={Meng, Rui and Zhao, Sanqiang and Han, Shuguang and He, Daqing and Brusilovsky, Peter and Chi, Yu}, journal={arXiv preprint arXiv:1704.06879}, year={2017} } ``` ## Contributions Thanks to [@debanjanbhucs](https://github.com/debanjanbhucs), [@dibyaaaaax](https://github.com/dibyaaaaax), [@UmaGunturi](https://github.com/UmaGunturi) and [@ad6398](https://github.com/ad6398) for adding this dataset
提供机构:
midas
原始信息汇总

数据集概述

数据集目的

用于评估从长篇英文科学论文中提取和生成关键词的技术。

数据集结构

数据字段
  • id: 文档的唯一标识符。
  • sections: 文档中所有部分的列表。
  • sec_text: 每个部分中单词的列表,以空格分隔。
  • sec_bio_tags: 每个部分中单词的BIO标签列表,以空格分隔。
  • extractive_keyphrases: 当前存在的所有关键词列表。
  • abstractive_keyphrase: 当前不存在的所有关键词列表。
数据分割
分割 数据点数量
Train-Small 20,000
Train-Medium 50,000
Train-Large 1,296,613
Test 10,000
Validation 10,000

使用方法

  • Small Dataset: 示例代码展示了如何加载和处理小型数据集。
  • Medium Dataset: 示例代码展示了如何加载中型数据集。
  • Large Dataset: 示例代码展示了如何加载大型数据集。

引用信息

  • Mahata, Debanjan et al. "LDKP: A Dataset for Identifying Keyphrases from Long Scientific Documents." arXiv preprint arXiv:2203.15349 (2022).
  • Lo, Kyle et al. "S2ORC: The semantic scholar open research corpus." arXiv preprint arXiv:1911.02782 (2019).
  • Çano, Erion and Bojar, Ondřej. "Keyphrase generation: A multi-aspect survey." 2019 25th Conference of Open Innovations Association (FRUCT) (2019).
  • Meng, Rui et al. "Deep keyphrase generation." arXiv preprint arXiv:1704.06879 (2017).
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作