five

its5Q/wikireading

收藏
Hugging Face2024-08-29 更新2024-12-14 收录
下载链接:
https://hf-mirror.com/datasets/its5Q/wikireading
下载链接
链接失效反馈
官方服务:
资源简介:
Wikireading是一个从俄罗斯网站Wikireading上抓取的非小说类教育书籍章节的数据集,涵盖多个领域如生物学、艺术、历史、宗教等。这些书籍具有高度教育性,提供了不同领域的广泛知识,使得该数据集成为预训练的良好选择。数据集包含约2600万行,总计约70亿个标记(约280亿个字符),主要是俄语文本,也有部分其他斯拉夫语言。每个数据行代表一本书的一个章节,包含书籍标题、作者、HTML内容和提取的文本等信息。

Wikireading is a dataset of non-fiction educational book chapters scraped from a Russian website called Wikireading. These books cover various domains such as Biology, Art, History, Religion, and more, offering high educational value and making this dataset suitable for pretraining. The dataset contains approximately 26 million rows, totaling around 7 billion tokens (approximately 28 billion characters) of mostly Russian text, with some books written in other Slavic languages. Each row represents a single chapter of a book, containing the book title, author, the HTML of the book returned by Wikireading, and the books text extracted using Trafilatura. Additionally, there is a column named `litres_preview` that indicates whether the book is a preview provided by Litres, which may contain incomplete chapters.
提供机构:
its5Q
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作