OpenSakura/OpenSakura-DS-260220-LN-ja-zh-ALIGNED-Eve
收藏Hugging Face2026-03-01 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/OpenSakura/OpenSakura-DS-260220-LN-ja-zh-ALIGNED-Eve
下载链接
链接失效反馈官方服务:
资源简介:
---
pretty_name: OpenSakura Eve LN Aligned Dataset
license: other
language:
- ja
- zh
language_bcp47:
- ja
- zh-Hans
- zh-Hant
task_categories:
- translation
multilinguality: translation
annotations_creators:
- machine-generated
language_creators:
- found
size_categories:
- 100K<n<1M
tags:
- opensakura
- translation
- light-novel
- aligned
- ja
- zh-hans
- zh-hant
- eve
configs:
- config_name: default
data_files:
- split: train
path: data/train-*.parquet
- split: arena
path: data/arena-*.parquet
- split: reserve
path: data/reserve-*.parquet
- split: validation
path: data/validation-*.parquet
- split: test
path: data/test-*.parquet
---
# OpenSakura Eve LN Aligned Dataset
`OpenSakura-DS-260220-LN-ja-zh-ALIGNED-Eve` is a Japanese-to-Chinese light-novel translation dataset in OpenSakura ALIGNED format.
Stats below are computed from the actual generated parquet files.
## Dataset Summary
| Metric | Value |
|---|---|
| Dataset ID | `OpenSakura/OpenSakura-DS-260220-LN-ja-zh-ALIGNED-Eve` |
| Total rows | 631,009 |
| Total parquet files | 213 (`train`: 148, `validation`: 22, `test`: 21) |
| Total size | 7,531,530,066 bytes (~7.53 GB, ~7.01 GiB) |
| Source language | `ja` |
| Target language | `zh*` |
| Domain | Light Novel (LN) |
| Training type | ALIGNED |
## Split Information
| Split | Rows | Share |
|---|---:|---:|
| train | 441,537 | 69.97% |
| arena | 31,629 | 5.01% |
| reserve | 31,806 | 5.04% |
| validation | 63,141 | 10.01% |
| test | 62,896 | 9.97% |
## Public Schema
Each row contains:
- `uuid` (`string`)
- `source_text` (`string`)
- `target_text` (`string`)
- `source_lang` (`string`)
- `target_lang` (`string`)
- `glossary_uuid` (`string | null`)
- `glossary` (`list[struct{term, translation, comment}]`)
- `gen_model`, `gen_frequency_penalty`, `gen_max_tokens`, `gen_temperature`, `gen_top_p`
- `input_tokens_count` (`int64`), `output_tokens_count` (`int64`)
## Tokenization and Generation Metadata
The following fields are included to make downstream filtering/debugging easier:
- Token counts are computed with tokenizer model `zai-org/GLM-4.7`.
- Generation metadata for this release:
- `gen_model`: `GLM-4.7`
- `gen_max_tokens`: `10240`
- `gen_temperature`: `0.6`
- `gen_top_p`: `0.95`
- `gen_frequency_penalty`: `0.0`
## Usage
```python
from datasets import load_dataset
dataset = load_dataset("OpenSakura/OpenSakura-DS-260220-LN-ja-zh-ALIGNED-Eve")
train = load_dataset("OpenSakura/OpenSakura-DS-260220-LN-ja-zh-ALIGNED-Eve", split="train")
```
## Acknowledgments
This dataset would not exist without the generous support and contributions of the following individuals:
- **@ixgbe** -- Professional guidance and hands-on help with Kubernetes infrastructure that powers the OpenSakura pipeline.
- **An anonymous group member** -- Generous sponsorship of H200 compute nodes. Without their support, OpenSakura could not have been founded and this dataset could not have been produced.
- **@Josepha** -- Invaluable contributions to dataset cleaning and processing techniques that shaped the quality of this release.
- **@lildub** -- LLM API sponsorship during the early experimental stage of the project, enabling the initial research and prototyping.
- **[@neavo](https://github.com/neavo)** -- Inspiration and solid implementation of the [LinguaGacha](https://github.com/neavo/LinguaGacha) open-source project, which informed the design of the translation API used in this project.
## Limitations and Intended Use
- Machine-generated translations may contain errors, inconsistencies, or noise.
- Target Chinese may contain mixed script variants depending on upstream language tags.
- Light-novel content may include mature/sensitive text (violence/sexual content/profanity).
- Intended for research/model development; evaluate upstream rights and redistribution constraints before commercial use.
提供机构:
OpenSakura



