soundstarrain/multilingual-small-languages-lightnovels-clean
收藏Hugging Face2026-03-22 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/soundstarrain/multilingual-small-languages-lightnovels-clean
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- bg
- eo
- fil
- hr
- mk
- my
- nl
- no
- th
- uk
pretty_name: multilingual-small-languages-lightnovels-clean
license: other
license_name: custom-restricted-fan-translation
tags:
- multilingual
- light-novel
- fan-translation
- curated
- restricted
- small-languages
size_categories:
- 10K<n<100K
configs:
- config_name: default
data_files:
- split: train
path: lines-*.parquet
---
multilingual-small-languages-lightnovels-clean is a cleaned multilingual small-language light novel dataset bundled from Baka-Tsuki fan-translation project pages and recoverable linked sources.
The dataset contains 10 languages, 12 series, 58 chapters, and 15,302 line-level records, with 1,036,789 characters, 179,105 words, and 374,346 tokens measured with the Qwen/Qwen3-8B tokenizer.
The default Hugging Face dataset view uses one row per text line from the merged `lines-*.parquet` export. The original hierarchical text layout for each included language package is preserved under `languages/`.
This bundle includes: bg (Bulgarian), eo (Esperanto), fil (Filipino), hr (Croatian), mk (Macedonian), my (Myanmar), nl (Dutch), no (Norwegian), th (Thai), uk (Ukrainian).
The source texts were manually reviewed and strict-cleaned to remove obvious wrong-language pages, placeholder pages, glossary pages, registration/project pages, translator-note pages, illustration-only pages, duplicated full-text wrapper pages, and other non-story material, while keeping in-book forewords, afterwords, author notes, and commentary when they belonged to the original volume.
Copyright and translation rights remain with the original rightsholders, publishers, and/or translators. This dataset should be treated as a restricted research dataset and should not be assumed to be freely redistributable or commercially reusable.
提供机构:
soundstarrain



