five

soundstarrain/de-lightnovels-clean

收藏
Hugging Face2026-03-21 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/soundstarrain/de-lightnovels-clean
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - de pretty_name: de-lightnovels-clean license: other license_name: custom-restricted-fan-translation tags: - german - light-novel - fan-translation - curated - restricted size_categories: - 100K<n<1M configs: - config_name: default data_files: - split: train path: train-*.parquet --- de-lightnovels-clean is a cleaned German light novel dataset built from Baka-Tsuki German project pages and recoverable linked German fan-translation sources. The dataset contains 15 series, 341 chapters, and 282,247 line-level records organized from a series / volume / chapter corpus, with 18,388,040 characters, 2,865,323 words, and 5,187,303 tokens measured with the Qwen/Qwen3-8B tokenizer. The default Hugging Face dataset view uses one row per text line. The original hierarchical text layout under `novels/` is preserved alongside the standard `train-*.parquet` export. The source texts were manually reviewed and additionally strict-cleaned to remove obvious wrong-language pages, placeholder pages, glossary pages, Wikipedia/archive/project pages, translator-note pages, duplicated entries, and other non-story material, while keeping in-book forewords, afterwords, author notes, and commentary when they belonged to the original volume. Copyright and translation rights remain with the original rightsholders, publishers, and/or translators. This dataset should be treated as a restricted research dataset and should not be assumed to be freely redistributable or commercially reusable.

language: - 德语(de) pretty_name: de-lightnovels-clean license: 其他(other) license_name: 自定义受限同人翻译许可(custom-restricted-fan-translation) tags: - 德语(german) - 轻小说(light-novel) - 同人翻译(fan-translation) - 精选(curated) - 受限(restricted) size_categories: - 100K<n<1M configs: - config_name: 默认(default) data_files: - split: 训练集(train) path: train-*.parquet de-lightnovels-clean是一个基于Baka-Tsuki德语项目页面及可恢复的关联德语同人翻译资源构建的经清洗德语轻小说数据集。 该数据集包含15部系列作品、341个章节,共计282,247条行级记录,按照系列/卷/章节的语料层级进行组织,总字符数为18,388,040,单词数2,865,323,使用Qwen/Qwen3-8B分词器(Qwen/Qwen3-8B tokenizer)统计得到的词元(Token)数为5,187,303。 默认的拥抱脸(Hugging Face)数据集视图采用每行对应一条文本行的格式。除标准的`train-*.parquet`导出文件外,`novels/`目录下保留了原始的层级化文本排版结构。 源文本经人工审核并进行了严格清洗,以移除明显的语言错误页面、占位符页面、术语表页面、维基百科/归档/项目页面、译者注页面、重复条目及其他非故事类内容;同时保留了属于原卷本的卷首语、卷尾语、作者注及相关评论内容。 版权及翻译权归原权利持有者、出版方及/或译者所有。本数据集属于受限研究用数据集,不得被视为可自由分发或用于商业复用。
提供机构:
soundstarrain
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作