soundstarrain/id-lightnovels-clean
收藏Hugging Face2026-03-22 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/soundstarrain/id-lightnovels-clean
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- id
pretty_name: id-lightnovels-clean
license: other
license_name: custom-restricted-fan-translation
tags:
- indonesian
- light-novel
- fan-translation
- curated
- restricted
size_categories:
- 1M<n<10M
configs:
- config_name: default
data_files:
- split: train
path: lines-*.parquet
---
id-lightnovels-clean is a cleaned Indonesian light novel dataset built from Baka-Tsuki Indonesian project pages and recoverable linked Indonesian fan-translation sources.
The dataset contains 44 series, 2340 chapters, and 1,391,604 line-level records organized from a series / volume / chapter corpus, with 78,512,293 characters, 10,882,925 words, and 23,881,049 tokens measured with the Qwen/Qwen3-8B tokenizer.
The default Hugging Face dataset view uses one row per text line. The original hierarchical text layout under `novels/` is preserved alongside the standard `lines-*.parquet` export.
The source texts were manually reviewed and additionally strict-cleaned to remove obvious wrong-language pages, placeholder pages, glossary pages, Wikipedia/archive/project pages, translator-note pages, duplicated entries, and other non-story material, while keeping in-book forewords, afterwords, author notes, and commentary when they belonged to the original volume.
Copyright and translation rights remain with the original rightsholders, publishers, and/or translators. This dataset should be treated as a restricted research dataset and should not be assumed to be freely redistributable or commercially reusable.
提供机构:
soundstarrain



