soundstarrain/lt-lightnovels-clean

Name: soundstarrain/lt-lightnovels-clean
Creator: soundstarrain
Published: 2026-03-22 16:17:09
License: 暂无描述

Hugging Face2026-03-22 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/soundstarrain/lt-lightnovels-clean

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - lt pretty_name: lt-lightnovels-clean license: other license_name: custom-restricted-fan-translation tags: - lithuanian - light-novel - fan-translation - curated - restricted size_categories: - 1K<n<10K configs: - config_name: default data_files: - split: train path: lines-*.parquet --- lt-lightnovels-clean is a cleaned Lithuanian light novel dataset built from Baka-Tsuki Lithuanian project pages and recoverable linked Lithuanian fan-translation sources. The dataset contains 3 series, 40 chapters, and 8,262 line-level records organized from a series / volume / chapter corpus, with 532,734 characters, 76,257 words, and 232,355 tokens measured with the Qwen/Qwen3-8B tokenizer. The default Hugging Face dataset view uses one row per text line. The original hierarchical text layout under `novels/` is preserved alongside the standard `lines-*.parquet` export. The source texts were manually reviewed and additionally strict-cleaned to remove obvious wrong-language pages, placeholder pages, glossary pages, Wikipedia/archive/project pages, translator-note pages, duplicated entries, and other non-story material, while keeping in-book forewords, afterwords, author notes, and commentary when they belonged to the original volume. Copyright and translation rights remain with the original rightsholders, publishers, and/or translators. This dataset should be treated as a restricted research dataset and should not be assumed to be freely redistributable or commercially reusable.

提供机构：

soundstarrain

5,000+

优质数据集

54 个

任务类型

进入经典数据集