nenekochan/yoruno-vn
收藏Hugging Face2024-04-08 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/nenekochan/yoruno-vn
下载链接
链接失效反馈官方服务:
资源简介:
---
pretty_name: 夜羊L系列脚本
language:
- zh
- ja
language_details: zho_Hans, jpn
license: cc-by-nc-4.0
annotations_creators:
- expert-generated
- machine-generated
task_categories:
- text-generation
tags:
- not-for-all-audiences
viewer: false
---
> 睡不着的夜晚和不想睡觉的夜晚
## ⚠️注意
- **请注意,数据来自 R18 的视觉小说,并且包含可能被认为是不适当、令人震惊、令人不安、令人反感和极端的主题。如果您不确定在您的国家拥有任何形式的虚构文字内容的法律后果,请不要下载。**
- **本项目内的所有数据及基于这些数据的衍生作品禁止用作商业性目的。** 我不拥有 `scenario-raw` 和 `scenario_ja-raw` 里的 krkr2 脚本源文件,而其余的数据处理方法按照 CC BY-NC 4.0 协议开放。
- 🔑 压缩包已加密,解压密码是 yorunohitsuji
## 文件结构
```
yoruno-vn.7z # (zh)
├── scenario-raw/ # krkr2 脚本源文件
├── scenario/ # 清理后的结构化脚本
└── conversation/ # 我主观分段制作的对话格式数据
yoruno_ja-vn.7z # (ja)
├── scenario_ja-raw/ # krkr2 脚本源文件
├── scenario_ja/ # 清理后的结构化脚本
└── sound_ja/ # (并不存在的语音和)我手工标注的分类元信息
```
- 对于主观分段,一部分是手动的,其余是基于文本相似度的不太靠谱自动分段(我还没推的那部分,我不想被剧透啊啊啊)。手动分段道且阻且长,慢慢做吧,进度记录在 [manual_seg-progress.md](manual_seg-progress.md)。
- 2015-2017 的前四作是单女主,后面的作品都是双女主的,脚本格式也略微不同。
- 主观分段内容排除了一些与 npc 的对话、改了错别字,所以和原脚本不完全一致。
- 语音文件果然还是不放了,不过我有标注元信息,用来分离出含有喘和或口腔音的语音。
## 给我自己看的预处理流程
0. 各作的脚本提取出来放在 `scenario[_ja]-raw/` 里,用 `script/transcode.sh` 转成 UTF-8,`2015-sssa` 额外需要 `script/dos2unix.sh` 转成 LF
1. 修复格式小问题 `cd scenario-raw && bash patch.sh`
2. 运行 `python ks-parse-all.py --voice scenario[_ja]-raw/ scenario[_ja]/` 得到 `scenario[_ja]/`
3. 分段,再转成 `conversation/`
a. 自动分段:`python -m segment.auto path/to/scenario.jsonl`
b. 手动分段后,`python -m segment.manual path/to/scenario-manual_seg.jsonl`
添加新卷:
0. 脚本放在 `scenario[_ja]-raw/` 里
1. 在 `ks-parse-all.py` 里添加新卷的元数据
## 致谢
日夜陪伴我的夜羊社和数据源背后的众汉化组……
pretty_name: Yeyang L Series Scripts
language:
- zh
- ja
language_details: zho_Hans, jpn
license: cc-by-nc-4.0
annotations_creators:
- expert-generated
- machine-generated
task_categories:
- text-generation
tags:
- not-for-all-audiences
viewer: false
> *Nights spent awake and nights unwilling to sleep*
## ⚠️ Notice
- **Please note that this dataset is sourced from R18 visual novels and contains content that may be considered inappropriate, shocking, distressing, offensive, or extreme. If you are unsure about the legal consequences of possessing any form of fictional textual content in your country, please do not download this dataset.**
- **All data in this project and derivative works based on these data are prohibited from being used for commercial purposes.** I do not own the krkr2 script source files in `scenario-raw` and `scenario_ja-raw`, while the remaining data processing methods are licensed under CC BY-NC 4.0.
- 🔑 The compressed package is encrypted, with the extraction password being `yorunohitsuji`
## File Structure
yoruno-vn.7z # (Chinese)
├── scenario-raw/ # krkr2 script source files
├── scenario/ # cleaned structured scripts
└── conversation/ # dialogue-formatted data created via subjective segmentation by me
yoruno_ja-vn.7z # (Japanese)
├── scenario_ja-raw/ # krkr2 script source files
├── scenario_ja/ # cleaned structured scripts
└── sound_ja/ # (non-existent audio + manually annotated classification metadata)
- For subjective segmentation: Some segments are completed manually, while the rest are generated via a somewhat unreliable automatic segmentation method based on text similarity (for the works I haven't played through yet— I don't want to get spoilers, ugh!). Manual segmentation is a long and arduous process, so we will proceed gradually. The progress is tracked in [manual_seg-progress.md](manual_seg-progress.md).
- The first four works released between 2015 and 2017 feature a single heroine, while all subsequent works have dual heroines, with slightly differing script formats.
- The subjective segmentation content excludes some dialogues with NPCs and corrects typos, so it is not fully consistent with the original scripts.
- Audio files were ultimately not included, but I have annotated metadata to separate audio containing panting and/or oral sounds.
## Preprocessing Workflow (For Personal Reference)
0. Extract the scripts of each work into `scenario[_ja]-raw/`, transcode them to UTF-8 using `script/transcode.sh`, and the `2015-sssa` project additionally requires `script/dos2unix.sh` to convert to LF line endings.
1. Fix minor format issues: `cd scenario-raw && bash patch.sh`
2. Run `python ks-parse-all.py --voice scenario[_ja]-raw/ scenario[_ja]/` to generate structured files in the `scenario[_ja]/` directory.
3. Segment the scripts and convert them to the `conversation/` format:
a. Automatic segmentation: `python -m segment.auto path/to/scenario.jsonl`
b. After manual segmentation: `python -m segment.manual path/to/scenario-manual_seg.jsonl`
### Adding New Volumes
0. Place the new scripts in `scenario[_ja]-raw/`
1. Add metadata for the new volume in `ks-parse-all.py`
## Acknowledgments
To Yeyang Studio, which accompanied me day and night, and all the localization teams behind the data sources...
提供机构:
nenekochan
原始信息汇总
夜羊L系列脚本数据集
基本信息
- 名称: 夜羊L系列脚本
- 语言:
- 中文 (zho_Hans)
- 日文 (jpn)
- 许可证: CC BY-NC 4.0
- 标注创建者:
- 专家生成
- 机器生成
- 任务类别: 文本生成
- 标签: not-for-all-audiences
注意事项
- 数据来自R18的视觉小说,包含可能被认为是不适当、令人震惊、令人不安、令人反感和极端的主题。
- 本项目内的所有数据及基于这些数据的衍生作品禁止用作商业性目的。
- 压缩包已加密,解压密码是
yorunohitsuji。
文件结构
- 中文数据:
yoruno-vn.7zscenario-raw/: krkr2 脚本源文件scenario/: 清理后的结构化脚本conversation/: 主观分段制作的对话格式数据
- 日文数据:
yoruno_ja-vn.7zscenario_ja-raw/: krkr2 脚本源文件scenario_ja/: 清理后的结构化脚本sound_ja/: 手工标注的分类元信息
分段说明
- 主观分段一部分是手动的,其余是基于文本相似度的自动分段。
- 手动分段进度记录在
manual_seg-progress.md。 - 分段内容排除了一些与 NPC 的对话、改了错别字,所以和原脚本不完全一致。
预处理流程
- 提取脚本并转成 UTF-8 格式。
- 修复格式小问题。
- 运行脚本解析并分段。
- 添加新卷的元数据。
搜集汇总
数据集介绍

背景与挑战
背景概述
该数据集名为nenekochan/yoruno-vn,主要用于文本生成任务,支持中文和日语。数据集被标记为包含敏感内容,适用于特定受众,采用cc-by-nc-4.0许可证。
以上内容由遇见数据集搜集并总结生成



