five

soundstarrain/ru-lightnovels-clean

收藏
Hugging Face2026-03-21 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/soundstarrain/ru-lightnovels-clean
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - ru pretty_name: ru-lightnovels-clean license: other license_name: custom-restricted-fan-translation tags: - russian - light-novel - fan-translation - curated - restricted size_categories: - 100K<n<1M configs: - config_name: default data_files: - split: train path: train-*.parquet --- ru-lightnovels-clean is a cleaned Russian light novel dataset built from Baka-Tsuki Russian project pages and recoverable linked Russian fan-translation sources. The dataset contains 27 series, 699 chapters, and 568,933 line-level records organized from a series / volume / chapter corpus, with 31,074,226 characters, 4,804,449 words, and 11,226,584 tokens measured with the Qwen/Qwen3-8B tokenizer. The default Hugging Face dataset view uses one row per text line. The original hierarchical text layout under `novels/` is preserved alongside the standard `train-*.parquet` export. The source texts were manually reviewed and additionally strict-cleaned to remove obvious wrong-language pages, placeholder pages, glossary pages, Wikipedia/archive/project pages, translator-note pages, duplicated entries, and other non-story material, while keeping in-book forewords, afterwords, author notes, and commentary when they belonged to the original volume. Copyright and translation rights remain with the original rightsholders, publishers, and/or translators. This dataset should be treated as a restricted research dataset and should not be assumed to be freely redistributable or commercially reusable.

--- 语言:俄语(ru) 友好名称:ru-lightnovels-clean 许可证:其他 许可证名称:自定义受限粉丝翻译许可(custom-restricted-fan-translation) 标签: - 俄语 - 轻小说 - 粉丝翻译 - 精选整理 - 受限使用 数据规模类别: - 10万<n<100万条记录 配置项: - 配置名称:默认配置 数据文件: - 拆分方式:训练集 路径:train-*.parquet --- ru-lightnovels-clean是一款经过清洗的俄语轻小说数据集,其数据源自Baka-Tsuki俄语项目页面以及可追溯的关联俄语粉丝翻译资源。 该数据集包含27部系列作品、699个章节,共计568933条行级记录,数据按照系列/卷/章节的语料结构进行组织;经Qwen/Qwen3-8B分词器统计,数据集共包含31074226个字符、4804449个单词以及11226584个词元(Token)。 Hugging Face默认数据集视图采用每行对应一条文本记录的格式。除标准的`train-*.parquet`导出文件外,`novels/`目录下的原始层级文本结构也得到了保留。 该数据集的源文本经过人工审核与严格清洗:移除了明显的语言错误页面、占位页面、术语表页面、维基百科/归档/项目页面、译者注页面、重复条目以及其他非故事类内容;同时保留了属于原卷的卷首语、卷尾语、作者注与评论内容。 版权与翻译权归原权利方、出版方及/或译者所有。本数据集属于受限研究数据集,不得被视为可自由分发或商用复用的资源。
提供机构:
soundstarrain
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作