five

ClovenDoug/keyphrases_updated

收藏
Hugging Face2024-11-29 更新2024-12-14 收录
下载链接:
https://hf-mirror.com/datasets/ClovenDoug/keyphrases_updated
下载链接
链接失效反馈
官方服务:
资源简介:
这是一个用于学术搜索的关键词数据集,包含约3-4百万个最重要的关键词。数据集通过过滤超过10亿个n-gram中的特定停用词生成,每个关键词在原始数据集中出现次数大于等于5。这些关键词具有较高的TF-IDF分数,其中文档被调整为学术子领域。关键词在越少的子领域中出现,其分数越高。数据集涵盖算法名称、小众子领域、科学概念、疾病等。具体包括103,314个单字词、378,041个双字词、2,444,002个三字词和711,771个四字词。

This is a larger dataset of keyphrases that can be used for academic search. They were largely generated from over 1 billion ngrams, filtering out specific stop words, and each have count >= 5 in the original dataset. Additionally, the keyphrases you see here had a high score of tf-idf, where term frequency inverse document frequency has had document adjusted for academic subfields. The less subfields the phrase occurred in, the higher the score. So, this dataset is a fairly large dataset of around the 3-4 million most important keyphrases youd encounter for searching for academic articles. It includes names of algorithms, niche subfields, scientific concepts, diseases, and so on. The dataset includes unigrams, bigrams, trigrams, and fourgrams.
提供机构:
ClovenDoug
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作