soundstarrain/vi-lightnovels-clean

Name: soundstarrain/vi-lightnovels-clean
Creator: soundstarrain
Published: 2026-03-22 16:14:59
License: 暂无描述

Hugging Face2026-03-22 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/soundstarrain/vi-lightnovels-clean

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - vi pretty_name: vi-lightnovels-clean license: other license_name: custom-restricted-fan-translation tags: - vietnamese - light-novel - fan-translation - curated - restricted size_categories: - 100K<n<1M configs: - config_name: default data_files: - split: train path: lines-*.parquet --- vi-lightnovels-clean is a cleaned Vietnamese light novel dataset built from Baka-Tsuki Vietnamese project pages and recoverable linked Vietnamese fan-translation sources. The dataset contains 19 series, 244 chapters, and 165,612 line-level records organized from a series / volume / chapter corpus, with 8,147,606 characters, 1,755,370 words, and 2,485,904 tokens measured with the Qwen/Qwen3-8B tokenizer. The default Hugging Face dataset view uses one row per text line. The original hierarchical text layout under `novels/` is preserved alongside the standard `lines-*.parquet` export. The source texts were manually reviewed and additionally strict-cleaned to remove obvious wrong-language pages, placeholder pages, glossary pages, Wikipedia/archive/project pages, translator-note pages, duplicated entries, and other non-story material, while keeping in-book forewords, afterwords, author notes, and commentary when they belonged to the original volume. Copyright and translation rights remain with the original rightsholders, publishers, and/or translators. This dataset should be treated as a restricted research dataset and should not be assumed to be freely redistributable or commercially reusable.

--- language: - 越南语 pretty_name: vi-lightnovels-clean license: 其他 license_name: 自定义受限同人翻译许可 tags: - 越南语 - 轻小说 - 同人翻译 - 经精选整理 - 受限 size_categories: - 100K<n<1M configs: - config_name: 默认配置 data_files: - split: 训练集 path: lines-*.parquet --- vi-lightnovels-clean是一款经过清洗处理的越南语轻小说数据集，数据源自Baka-Tsuki越南语项目页面及可追溯的关联越南语同人翻译源。本数据集共包含19部作品系列、244个章节，以及165612条行级数据记录，按「系列/卷/章节」的层级语料结构组织；经Qwen/Qwen3-8B分词器（tokenizer）统计，数据集总字符数为8147606，总词数为1755370，总Token（Token）数为2485904。默认的Hugging Face数据集视图以每行对应一条文本行的形式展示。同时，标准导出的`lines-*.parquet`文件保留了`novels/`目录下的原始层级文本布局结构。源文本均经过人工审核，并进行了严格的二次清洗：移除了明显的语言错误页面、占位页面、词汇表页面、维基百科/归档/项目页面、译者注页面、重复条目及其他非故事类内容；同时保留了原卷册中附带的卷首语、卷尾语、作者注及相关评论内容。本数据集的版权及翻译权归原版权方、出版方及/或译者所有。本数据集仅可用于受限的科研用途，不得随意自由分发或用于商业复用。

提供机构：

soundstarrain

5,000+

优质数据集

54 个

任务类型

进入经典数据集