OpenSakura/OpenSakura-DS-260220-LN-ja-zh-PT-Adam
收藏Hugging Face2026-03-01 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/OpenSakura/OpenSakura-DS-260220-LN-ja-zh-PT-Adam
下载链接
链接失效反馈官方服务:
资源简介:
---
pretty_name: "OpenSakura Adam LN Pretrain Dataset"
license: other
language:
- ja
- zh
language_bcp47:
- ja
- zh-Hans
- zh-Hant
task_categories:
- text-generation
multilinguality: multilingual
annotations_creators:
- no-annotation
language_creators:
- found
size_categories:
- n>1M
tags:
- opensakura
- pretrain
- light-novel
- multilingual
- ja
- zh-hans
- zh-hant
- pt
configs:
- config_name: default
data_files:
- split: train
path: data/train-*.parquet
- split: arena
path: data/arena-*.parquet
- split: reserve
path: data/reserve-*.parquet
- split: validation
path: data/validation-*.parquet
- split: test
path: data/test-*.parquet
---
# OpenSakura Adam LN Pretrain Dataset
`OpenSakura-DS-260220-LN-ja-zh-PT-Adam` is a large-scale pretraining corpus built from light-novel source shards and filtered to Japanese/Chinese scripts.
This export is intended for PT/CPT-style language modeling.
## Dataset Summary
| Metric | Value |
|---|---|
| Dataset ID | `OpenSakura/OpenSakura-DS-260220-LN-ja-zh-PT-Adam` |
| Total rows | 9,515,512 |
| Total parquet files | 480 |
| Total size | 63,621,025,693 bytes (~63.62 GB, ~59.25 GiB) |
| Languages (BCP-47) | `ja`, `zh-Hans`, `zh-Hant` |
| Domain | Light Novel (LN) |
| Training type | PT (pretrain) |
## Split Information
| Split | Rows | Share |
|---|---:|---:|
| train | 6,662,783 | 70.02% |
| arena | 474,135 | 4.98% |
| reserve | 475,823 | 5.00% |
| validation | 952,813 | 10.01% |
| test | 949,958 | 9.98% |
## Public Schema
Each row contains:
- `uuid` (`string`)
- `text` (`string`)
- `lang` (`string`)
- `token_count` (`int64`)
- `glossary_id` (`string`)
- `glossary` (`list[struct{term, translation, comment}]`)
## Usage
```python
from datasets import load_dataset
dataset = load_dataset("OpenSakura/OpenSakura-DS-260220-LN-ja-zh-PT-Adam")
train = load_dataset("OpenSakura/OpenSakura-DS-260220-LN-ja-zh-PT-Adam", split="train")
```
## Limitations and Intended Use
- This is a **pretraining corpus** (`text` + `lang`), not aligned translation pairs.
- Language filtering/detection can still make mistakes on short, noisy, or mixed-script lines.
- Light-novel content may include mature/sensitive text (violence/sexual content/profanity).
- Intended for research/model development; evaluate upstream rights and redistribution constraints before commercial use.
提供机构:
OpenSakura



