five

soundstarrain/multilingual-small-languages-lightnovels-clean

收藏
Hugging Face2026-03-22 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/soundstarrain/multilingual-small-languages-lightnovels-clean
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - bg - eo - fil - hr - mk - my - nl - no - th - uk pretty_name: multilingual-small-languages-lightnovels-clean license: other license_name: custom-restricted-fan-translation tags: - multilingual - light-novel - fan-translation - curated - restricted - small-languages size_categories: - 10K<n<100K configs: - config_name: default data_files: - split: train path: lines-*.parquet --- multilingual-small-languages-lightnovels-clean is a cleaned multilingual small-language light novel dataset bundled from Baka-Tsuki fan-translation project pages and recoverable linked sources. The dataset contains 10 languages, 12 series, 58 chapters, and 15,302 line-level records, with 1,036,789 characters, 179,105 words, and 374,346 tokens measured with the Qwen/Qwen3-8B tokenizer. The default Hugging Face dataset view uses one row per text line from the merged `lines-*.parquet` export. The original hierarchical text layout for each included language package is preserved under `languages/`. This bundle includes: bg (Bulgarian), eo (Esperanto), fil (Filipino), hr (Croatian), mk (Macedonian), my (Myanmar), nl (Dutch), no (Norwegian), th (Thai), uk (Ukrainian). The source texts were manually reviewed and strict-cleaned to remove obvious wrong-language pages, placeholder pages, glossary pages, registration/project pages, translator-note pages, illustration-only pages, duplicated full-text wrapper pages, and other non-story material, while keeping in-book forewords, afterwords, author notes, and commentary when they belonged to the original volume. Copyright and translation rights remain with the original rightsholders, publishers, and/or translators. This dataset should be treated as a restricted research dataset and should not be assumed to be freely redistributable or commercially reusable.
提供机构:
soundstarrain
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作