bangboom/chinese-translation-data
收藏Hugging Face2026-04-03 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/bangboom/chinese-translation-data
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
task_categories:
- translation
language:
- zh
tags:
- chinese
- classical-chinese
- machine-translation
- fairseq
size_categories:
- 100K<n<1M
---
# Chinese Translation Data (WebTrans)
This dataset contains training data for Chinese text translation tasks, including classical-to-modern Chinese, modern-to-classical Chinese, punctuation addition, and poetry translation. The data is in binarized fairseq format ready for training sequence-to-sequence models.
## Dataset Structure
```
├── bpe_file/ # BPE model files
│ ├── addpunc.bpe # BPE codes for punctuation task
│ ├── new.bpe # BPE codes for modern Chinese
│ ├── old.bpe # BPE codes for classical Chinese
│ └── shici.bpe # BPE codes for poetry
├── dataflow_addpunc/ # Binarized data for punctuation addition
│ ├── dict.src.txt # Source vocabulary
│ ├── dict.tgt.txt # Target vocabulary
│ └── {train,valid,test}.src-tgt.{src,tgt}.{bin,idx}
├── dataflow_new2old/ # Binarized data: modern → classical Chinese
│ ├── dict.new.txt / dict.old.txt
│ └── {train,valid,test}.new-old.{new,old}.{bin,idx}
├── dataflow_old2new/ # Binarized data: classical → modern Chinese
│ ├── dict.new.txt / dict.old.txt
│ ├── tag
│ └── {train,valid,test}.old-new.{new,old}.{bin,idx}
├── dataflow_shici/ # Binarized data: poetry translation
│ ├── dict.key.txt / dict.sen.txt
│ └── {train,valid,test}.key-sen.{key,sen}.{bin,idx}
└── dictionary/ # Chinese dictionaries
├── dict.txt
├── 中国历史地名词典.txt # Chinese historical place names
├── 古代人名(25w).txt # Ancient person names (250K)
└── 成语(5W).txt # Chinese idioms (50K)
```
## Tasks
| Directory | Task | Source → Target |
|---|---|---|
| `dataflow_new2old` | Modern to Classical Chinese | new → old |
| `dataflow_old2new` | Classical to Modern Chinese | old → new |
| `dataflow_addpunc` | Punctuation Addition | src → tgt |
| `dataflow_shici` | Poetry Translation | key → sen |
## Usage
This data is in fairseq binarized format. To use with fairseq:
```bash
fairseq-train dataflow_old2new \
--save-dir checkpoints \
--arch transformer_iwslt_de_en \
--optimizer adam \
--lr 0.0005 \
--max-tokens 4096
```
## License
Apache-2.0
提供机构:
bangboom



