five

OpenSakura/OpenSakura-DS-260220-LN-ja-zh-ALIGNED-Eve

收藏
Hugging Face2026-03-01 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/OpenSakura/OpenSakura-DS-260220-LN-ja-zh-ALIGNED-Eve
下载链接
链接失效反馈
官方服务:
资源简介:
--- pretty_name: OpenSakura Eve LN Aligned Dataset license: other language: - ja - zh language_bcp47: - ja - zh-Hans - zh-Hant task_categories: - translation multilinguality: translation annotations_creators: - machine-generated language_creators: - found size_categories: - 100K<n<1M tags: - opensakura - translation - light-novel - aligned - ja - zh-hans - zh-hant - eve configs: - config_name: default data_files: - split: train path: data/train-*.parquet - split: arena path: data/arena-*.parquet - split: reserve path: data/reserve-*.parquet - split: validation path: data/validation-*.parquet - split: test path: data/test-*.parquet --- # OpenSakura Eve LN Aligned Dataset `OpenSakura-DS-260220-LN-ja-zh-ALIGNED-Eve` is a Japanese-to-Chinese light-novel translation dataset in OpenSakura ALIGNED format. Stats below are computed from the actual generated parquet files. ## Dataset Summary | Metric | Value | |---|---| | Dataset ID | `OpenSakura/OpenSakura-DS-260220-LN-ja-zh-ALIGNED-Eve` | | Total rows | 631,009 | | Total parquet files | 213 (`train`: 148, `validation`: 22, `test`: 21) | | Total size | 7,531,530,066 bytes (~7.53 GB, ~7.01 GiB) | | Source language | `ja` | | Target language | `zh*` | | Domain | Light Novel (LN) | | Training type | ALIGNED | ## Split Information | Split | Rows | Share | |---|---:|---:| | train | 441,537 | 69.97% | | arena | 31,629 | 5.01% | | reserve | 31,806 | 5.04% | | validation | 63,141 | 10.01% | | test | 62,896 | 9.97% | ## Public Schema Each row contains: - `uuid` (`string`) - `source_text` (`string`) - `target_text` (`string`) - `source_lang` (`string`) - `target_lang` (`string`) - `glossary_uuid` (`string | null`) - `glossary` (`list[struct{term, translation, comment}]`) - `gen_model`, `gen_frequency_penalty`, `gen_max_tokens`, `gen_temperature`, `gen_top_p` - `input_tokens_count` (`int64`), `output_tokens_count` (`int64`) ## Tokenization and Generation Metadata The following fields are included to make downstream filtering/debugging easier: - Token counts are computed with tokenizer model `zai-org/GLM-4.7`. - Generation metadata for this release: - `gen_model`: `GLM-4.7` - `gen_max_tokens`: `10240` - `gen_temperature`: `0.6` - `gen_top_p`: `0.95` - `gen_frequency_penalty`: `0.0` ## Usage ```python from datasets import load_dataset dataset = load_dataset("OpenSakura/OpenSakura-DS-260220-LN-ja-zh-ALIGNED-Eve") train = load_dataset("OpenSakura/OpenSakura-DS-260220-LN-ja-zh-ALIGNED-Eve", split="train") ``` ## Acknowledgments This dataset would not exist without the generous support and contributions of the following individuals: - **@ixgbe** -- Professional guidance and hands-on help with Kubernetes infrastructure that powers the OpenSakura pipeline. - **An anonymous group member** -- Generous sponsorship of H200 compute nodes. Without their support, OpenSakura could not have been founded and this dataset could not have been produced. - **@Josepha** -- Invaluable contributions to dataset cleaning and processing techniques that shaped the quality of this release. - **@lildub** -- LLM API sponsorship during the early experimental stage of the project, enabling the initial research and prototyping. - **[@neavo](https://github.com/neavo)** -- Inspiration and solid implementation of the [LinguaGacha](https://github.com/neavo/LinguaGacha) open-source project, which informed the design of the translation API used in this project. ## Limitations and Intended Use - Machine-generated translations may contain errors, inconsistencies, or noise. - Target Chinese may contain mixed script variants depending on upstream language tags. - Light-novel content may include mature/sensitive text (violence/sexual content/profanity). - Intended for research/model development; evaluate upstream rights and redistribution constraints before commercial use.
提供机构:
OpenSakura
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作