Sprakbanken/ngram
收藏Hugging Face2026-01-19 更新2026-02-07 收录
下载链接:
https://hf-mirror.com/datasets/Sprakbanken/ngram
下载链接
链接失效反馈官方服务:
资源简介:
该数据集包含来自挪威国家图书馆在2022年7月15日之前数字化的所有书籍和报纸的N-grams(单字词、双字词和三字词)频率信息。N-grams基于大约610,000本书和4,000,000份报纸,总计约1385亿个“标记”(即单词和标点符号)。数据集分为digavis(报纸)和digibok(书籍)两部分,每部分包含不同年份的N-grams计数,并提供了诸如第一个词、第二个词、第三个词、语言、年份和计数等字段。
This dataset contains n-grams (uni-, bi- and trigrams) from all books and newspapers digitized by the National Library of Norway before 2022-07-15. The N-grams are made based on approximately 610,000 books and 4,000,000 newspapers. In total, its about 138.5 billion tokens (i.e. words and punctuation). The dataset is divided into two main parts: digavis (newspapers) and digibok (books), each containing N-gram counts for different years and providing fields such as the first word, second word, third word, language, year, and count.
提供机构:
Sprakbanken



