five

soundstarrain/tr-lightnovels-clean

收藏
Hugging Face2026-03-22 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/soundstarrain/tr-lightnovels-clean
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - tr pretty_name: tr-lightnovels-clean license: other license_name: custom-restricted-fan-translation tags: - turkish - light-novel - fan-translation - curated - restricted size_categories: - 10K<n<100K configs: - config_name: default data_files: - split: train path: lines-*.parquet --- tr-lightnovels-clean is a cleaned Turkish light novel dataset built from Baka-Tsuki Turkish project pages and recoverable linked Turkish fan-translation sources. The dataset contains 4 series, 58 chapters, and 31,924 line-level records organized from a series / volume / chapter corpus, with 1,370,532 characters, 187,360 words, and 481,515 tokens measured with the Qwen/Qwen3-8B tokenizer. The default Hugging Face dataset view uses one row per text line. The original hierarchical text layout under `novels/` is preserved alongside the standard `lines-*.parquet` export. The source texts were manually reviewed and additionally strict-cleaned to remove obvious wrong-language pages, placeholder pages, glossary pages, Wikipedia/archive/project pages, translator-note pages, duplicated entries, and other non-story material, while keeping in-book forewords, afterwords, author notes, and commentary when they belonged to the original volume. Copyright and translation rights remain with the original rightsholders, publishers, and/or translators. This dataset should be treated as a restricted research dataset and should not be assumed to be freely redistributable or commercially reusable.

语言: - 土耳其语(tr) 数据集显示名:tr-lightnovels-clean 许可证类型:其他 许可证名称:自定义受限同人翻译许可 标签: - 土耳其语 - 轻小说 - 同人翻译 - 精选整理 - 受限使用 数据集规模分类: - 10K < 样本数 < 100K 配置项: - 配置名称:默认 数据文件: - 拆分集:训练集(train) 文件路径:lines-*.parquet tr-lightnovels-clean是一款经清洗处理的土耳其语轻小说数据集,数据源自Baka-Tsuki土耳其语项目页面及可溯源的关联土耳其语同人翻译资源。 该数据集涵盖4部系列作品、58个章节,总计31924条行级记录,数据按「系列/卷/章」的层级语料结构组织;经Qwen/Qwen3-8B分词器统计,数据集总字符数为1370532,单词数为187360,Token数为481515。 默认的Hugging Face数据集视图采用「一行对应一条文本行」的展示格式。除标准导出的`lines-*.parquet`文件外,`novels/`目录完整保留了原始的层级文本排版结构。 源文本经过人工审核与严格清洗:移除了明显的语言错误页面、占位页面、词汇表页面、维基百科/归档/项目页面、译者注页面、重复条目及其他非故事类内容;同时保留了归属原卷的卷首语、卷尾语、作者注及相关评论内容。 版权及翻译权归原权利人、出版方及/或译者所有。本数据集属于受限研究数据集,不得默认其可自由分发或用于商业用途。
提供机构:
soundstarrain
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作