ClovenDoug/150k_keyphrases_labelled
收藏Hugging Face2024-11-03 更新2024-12-14 收录
下载链接:
https://hf-mirror.com/datasets/ClovenDoug/150k_keyphrases_labelled
下载链接
链接失效反馈官方服务:
资源简介:
该数据集包含学术主题的重要关键词列表,其中关键词文件包含约150,000个关键词,每个关键词被赋予约300-400个标签,如算法、疾病、定理、引理、化合物、研究方法、领域、子领域、主题等。此外,还有约200万个额外的关键词,这些关键词已被分类为单字词、双字词、三字词和四字词。这些关键词的来源包括从学术数据库(如PubMed、Wikipedia)的网络抓取,以及使用LLM标注、命名实体识别等技术从摘要中提取。未来的工作包括移除由LLM人工创建的不需要的标签(约100个),并添加有用的网络抓取来源列表。
This dataset is a list of important keyphrases for academic topics. The keyphrases file contains around 150,000 keyphrases that are given about 300-400 labels, such as algorithm, disease, theorem, lemma, chemial compound, research methods, fields, subfields, topics, etc. We also have about 2 million additional keyphrases that have been sorted into unigrams, bigrams, trigrams and fourgrams. The keyphrases were obtained from a mixture of webscraping academic databases such as pubmed, wikipedia. Also, we used LLM labelling, named entity recognition over abstracts, and more. Future work includes removing around 100 unwanted labels artificially created by an LLM and adding a list of useful sources to web-scrape from.
提供机构:
ClovenDoug



