Keyword-Extraction-Datasets
收藏关键词提取数据集
本仓库包含七个用于自动关键词提取任务的标注数据集。每个数据集包含一个文档(.txt 或 .abstr)及其对应的黄金标准关键词列表(.key 或 .uncontr)。这些数据集用于我们的监督和非监督关键词提取研究。
数据集详情和收集统计
| 数据集 | |D| | L<sub>avg</sub> | N<sub>avg</sub> | K<sub>avg</sub> | KP<sub>avg</sub>| 描述 | | :--- | :---: | :---: | :---: | :---: | :---: | :--- | | Hulth2003 | 1500 | 129 | 23 | 10 | 90.07 | Inspec 数据集的摘要 | | WWW | 1248 | 174 | 9 | 5 | 64.97 | KDD 会议发表的 CS 文章摘要 | | KDD | 704 | 204 | 8 | 4 | 68.12 | WWW 会议发表的 CS 文章摘要 | | Marujo2012 | 450 | 427 | 69 | 48 | 99.31 | 在线新闻文章 | | Krapivin2009 | 2304 | 7961 | 11 | 5 | 96.91 | ACM 的完整科学文章 | | SemEval2010 | 244 | 8085 | 34 | 16 | 95.89 | ACM 的完整科学文章,为 SemEval2010 任务 5 创建 | | NLM500 | 500 | 4854 | 27 | 14 | 71.35 | PubMed 数据库的完整论文 |
- |D|: 文档数量
- L<sub>avg</sub>: 平均文档长度(以单词计)
- N<sub>avg</sub>: 每个文档平均分配的黄金标准关键词(单字)
- K<sub>avg</sub>: 每个文档平均分配的黄金标准关键词短语(n-gram)
- KP<sub>avg</sub>: 关键词短语在文本中的平均百分比
引用
Hulth2003
tex @inproceedings{hulth2003improved, title = "Improved Automatic Keyword Extraction given more Linguistic Knowledge", author = "Hulth, Anette", booktitle = "Proceedings of the 2003 Conference on EMNLP", pages = "216--223", year = "2003", organization = "ACL" }
Krapivin2009
tex @article{krapivin2009large, title = "Large Dataset for Keyphrases Extraction", author = "Krapivin, Mikalai and Autaeu, Aliaksandr and Marchese, Maurizio", journal = "Technical Report DISI-09-055", year = "2009", publisher = "University of Trento" }
NLM500
tex @inproceedings{aronson2000nlm, title = "The NLM Indexing Initiative", author = "Aronson and others", booktitle = "Proceedings of the AMIA Symposium", pages = "17", year = "2000", organization = "American Medical Informatics Association" }
SemEval2010
tex @inproceedings{kim2010semeval, title = "Semeval-2010 Task 5: Automatic Keyphrase Extraction from Scientific Articles", author = "Kim, Su Nam and Medelyan, Olena and Kan, Min-Yen and Baldwin, Timothy", booktitle = "Proceedings of the 5th International Workshop on Semantic Evaluation", pages = "21--26", year = "2010", organization = "Association for Computational Linguistics" }
Marujo2012
tex @inproceedings{marujo2012supervised, title = "Supervised Topical Key Phrase Extraction of News Stories using Crowdsourcing, Light Filtering and Co-reference Normalization", author = "Marujo, Lu{\i}s and Gershman, Anatole and Carbonell, Jaime and Frederking, Robert and Neto, Joa{`I}ƒo P", booktitle = "Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC-2012)", year = "2012" }
WWW 和 KDD
tex @inproceedings{gollapalli2014extracting, title = "Extracting keyphrases from research papers using citation networks", author = "Gollapalli, Sujatha Das and Caragea, Cornelia", booktitle = "Twenty-Eighth AAAI Conference on Artificial Intelligence", year = "2014" }




