nenekochan/yoruno-vn

Name: nenekochan/yoruno-vn
Creator: nenekochan
Published: 2024-04-08 04:04:38
License: 暂无描述

Hugging Face2024-04-08 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/nenekochan/yoruno-vn

下载链接

链接失效反馈

官方服务：

资源简介：

--- pretty_name: 夜羊L系列脚本 language: - zh - ja language_details: zho_Hans, jpn license: cc-by-nc-4.0 annotations_creators: - expert-generated - machine-generated task_categories: - text-generation tags: - not-for-all-audiences viewer: false --- > 睡不着的夜晚和不想睡觉的夜晚 ## ⚠️注意 - **请注意，数据来自 R18 的视觉小说，并且包含可能被认为是不适当、令人震惊、令人不安、令人反感和极端的主题。如果您不确定在您的国家拥有任何形式的虚构文字内容的法律后果，请不要下载。** - **本项目内的所有数据及基于这些数据的衍生作品禁止用作商业性目的。** 我不拥有 `scenario-raw` 和 `scenario_ja-raw` 里的 krkr2 脚本源文件，而其余的数据处理方法按照 CC BY-NC 4.0 协议开放。 - 🔑 压缩包已加密，解压密码是 yorunohitsuji ## 文件结构 ``` yoruno-vn.7z # (zh) ├── scenario-raw/ # krkr2 脚本源文件 ├── scenario/ # 清理后的结构化脚本 └── conversation/ # 我主观分段制作的对话格式数据 yoruno_ja-vn.7z # (ja) ├── scenario_ja-raw/ # krkr2 脚本源文件 ├── scenario_ja/ # 清理后的结构化脚本 └── sound_ja/ # （并不存在的语音和）我手工标注的分类元信息 ``` - 对于主观分段，一部分是手动的，其余是基于文本相似度的不太靠谱自动分段（我还没推的那部分，我不想被剧透啊啊啊）。手动分段道且阻且长，慢慢做吧，进度记录在 [manual_seg-progress.md](manual_seg-progress.md)。 - 2015-2017 的前四作是单女主，后面的作品都是双女主的，脚本格式也略微不同。 - 主观分段内容排除了一些与 npc 的对话、改了错别字，所以和原脚本不完全一致。 - 语音文件果然还是不放了，不过我有标注元信息，用来分离出含有喘和或口腔音的语音。 ## 给我自己看的预处理流程 0. 各作的脚本提取出来放在 `scenario[_ja]-raw/` 里，用 `script/transcode.sh` 转成 UTF-8，`2015-sssa` 额外需要 `script/dos2unix.sh` 转成 LF 1. 修复格式小问题 `cd scenario-raw && bash patch.sh` 2. 运行 `python ks-parse-all.py --voice scenario[_ja]-raw/ scenario[_ja]/` 得到 `scenario[_ja]/` 3. 分段，再转成 `conversation/` a. 自动分段：`python -m segment.auto path/to/scenario.jsonl` b. 手动分段后，`python -m segment.manual path/to/scenario-manual_seg.jsonl` 添加新卷： 0. 脚本放在 `scenario[_ja]-raw/` 里 1. 在 `ks-parse-all.py` 里添加新卷的元数据 ## 致谢日夜陪伴我的夜羊社和数据源背后的众汉化组……

pretty_name: Yeyang L Series Scripts language: - zh - ja language_details: zho_Hans, jpn license: cc-by-nc-4.0 annotations_creators: - expert-generated - machine-generated task_categories: - text-generation tags: - not-for-all-audiences viewer: false > *Nights spent awake and nights unwilling to sleep* ## ⚠️ Notice - **Please note that this dataset is sourced from R18 visual novels and contains content that may be considered inappropriate, shocking, distressing, offensive, or extreme. If you are unsure about the legal consequences of possessing any form of fictional textual content in your country, please do not download this dataset.** - **All data in this project and derivative works based on these data are prohibited from being used for commercial purposes.** I do not own the krkr2 script source files in `scenario-raw` and `scenario_ja-raw`, while the remaining data processing methods are licensed under CC BY-NC 4.0. - 🔑 The compressed package is encrypted, with the extraction password being `yorunohitsuji` ## File Structure yoruno-vn.7z # (Chinese) ├── scenario-raw/ # krkr2 script source files ├── scenario/ # cleaned structured scripts └── conversation/ # dialogue-formatted data created via subjective segmentation by me yoruno_ja-vn.7z # (Japanese) ├── scenario_ja-raw/ # krkr2 script source files ├── scenario_ja/ # cleaned structured scripts └── sound_ja/ # (non-existent audio + manually annotated classification metadata) - For subjective segmentation: Some segments are completed manually, while the rest are generated via a somewhat unreliable automatic segmentation method based on text similarity (for the works I haven't played through yet— I don't want to get spoilers, ugh!). Manual segmentation is a long and arduous process, so we will proceed gradually. The progress is tracked in [manual_seg-progress.md](manual_seg-progress.md). - The first four works released between 2015 and 2017 feature a single heroine, while all subsequent works have dual heroines, with slightly differing script formats. - The subjective segmentation content excludes some dialogues with NPCs and corrects typos, so it is not fully consistent with the original scripts. - Audio files were ultimately not included, but I have annotated metadata to separate audio containing panting and/or oral sounds. ## Preprocessing Workflow (For Personal Reference) 0. Extract the scripts of each work into `scenario[_ja]-raw/`, transcode them to UTF-8 using `script/transcode.sh`, and the `2015-sssa` project additionally requires `script/dos2unix.sh` to convert to LF line endings. 1. Fix minor format issues: `cd scenario-raw && bash patch.sh` 2. Run `python ks-parse-all.py --voice scenario[_ja]-raw/ scenario[_ja]/` to generate structured files in the `scenario[_ja]/` directory. 3. Segment the scripts and convert them to the `conversation/` format: a. Automatic segmentation: `python -m segment.auto path/to/scenario.jsonl` b. After manual segmentation: `python -m segment.manual path/to/scenario-manual_seg.jsonl` ### Adding New Volumes 0. Place the new scripts in `scenario[_ja]-raw/` 1. Add metadata for the new volume in `ks-parse-all.py` ## Acknowledgments To Yeyang Studio, which accompanied me day and night, and all the localization teams behind the data sources...

提供机构：

nenekochan

原始信息汇总

夜羊L系列脚本数据集

基本信息

名称: 夜羊L系列脚本
语言:
- 中文 (zho_Hans)
- 日文 (jpn)
许可证: CC BY-NC 4.0
标注创建者:
- 专家生成
- 机器生成
任务类别: 文本生成
标签: not-for-all-audiences

注意事项

数据来自R18的视觉小说，包含可能被认为是不适当、令人震惊、令人不安、令人反感和极端的主题。
本项目内的所有数据及基于这些数据的衍生作品禁止用作商业性目的。
压缩包已加密，解压密码是 yorunohitsuji。

文件结构

中文数据:
- yoruno-vn.7z
  - scenario-raw/: krkr2 脚本源文件
  - scenario/: 清理后的结构化脚本
  - conversation/: 主观分段制作的对话格式数据
日文数据:
- yoruno_ja-vn.7z
  - scenario_ja-raw/: krkr2 脚本源文件
  - scenario_ja/: 清理后的结构化脚本
  - sound_ja/: 手工标注的分类元信息