five

Sprakbanken/ngram

收藏
Hugging Face2026-01-19 更新2026-02-07 收录
下载链接:
https://hf-mirror.com/datasets/Sprakbanken/ngram
下载链接
链接失效反馈
官方服务:
资源简介:
该数据集包含来自挪威国家图书馆在2022年7月15日之前数字化的所有书籍和报纸的N-grams(单字词、双字词和三字词)频率信息。N-grams基于大约610,000本书和4,000,000份报纸,总计约1385亿个“标记”(即单词和标点符号)。数据集分为digavis(报纸)和digibok(书籍)两部分,每部分包含不同年份的N-grams计数,并提供了诸如第一个词、第二个词、第三个词、语言、年份和计数等字段。

This dataset contains n-grams (uni-, bi- and trigrams) from all books and newspapers digitized by the National Library of Norway before 2022-07-15. The N-grams are made based on approximately 610,000 books and 4,000,000 newspapers. In total, its about 138.5 billion tokens (i.e. words and punctuation). The dataset is divided into two main parts: digavis (newspapers) and digibok (books), each containing N-gram counts for different years and providing fields such as the first word, second word, third word, language, year, and count.
提供机构:
Sprakbanken
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作