five

bangboom/chinese-translation-data

收藏
Hugging Face2026-04-03 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/bangboom/chinese-translation-data
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 task_categories: - translation language: - zh tags: - chinese - classical-chinese - machine-translation - fairseq size_categories: - 100K<n<1M --- # Chinese Translation Data (WebTrans) This dataset contains training data for Chinese text translation tasks, including classical-to-modern Chinese, modern-to-classical Chinese, punctuation addition, and poetry translation. The data is in binarized fairseq format ready for training sequence-to-sequence models. ## Dataset Structure ``` ├── bpe_file/ # BPE model files │ ├── addpunc.bpe # BPE codes for punctuation task │ ├── new.bpe # BPE codes for modern Chinese │ ├── old.bpe # BPE codes for classical Chinese │ └── shici.bpe # BPE codes for poetry ├── dataflow_addpunc/ # Binarized data for punctuation addition │ ├── dict.src.txt # Source vocabulary │ ├── dict.tgt.txt # Target vocabulary │ └── {train,valid,test}.src-tgt.{src,tgt}.{bin,idx} ├── dataflow_new2old/ # Binarized data: modern → classical Chinese │ ├── dict.new.txt / dict.old.txt │ └── {train,valid,test}.new-old.{new,old}.{bin,idx} ├── dataflow_old2new/ # Binarized data: classical → modern Chinese │ ├── dict.new.txt / dict.old.txt │ ├── tag │ └── {train,valid,test}.old-new.{new,old}.{bin,idx} ├── dataflow_shici/ # Binarized data: poetry translation │ ├── dict.key.txt / dict.sen.txt │ └── {train,valid,test}.key-sen.{key,sen}.{bin,idx} └── dictionary/ # Chinese dictionaries ├── dict.txt ├── 中国历史地名词典.txt # Chinese historical place names ├── 古代人名(25w).txt # Ancient person names (250K) └── 成语(5W).txt # Chinese idioms (50K) ``` ## Tasks | Directory | Task | Source → Target | |---|---|---| | `dataflow_new2old` | Modern to Classical Chinese | new → old | | `dataflow_old2new` | Classical to Modern Chinese | old → new | | `dataflow_addpunc` | Punctuation Addition | src → tgt | | `dataflow_shici` | Poetry Translation | key → sen | ## Usage This data is in fairseq binarized format. To use with fairseq: ```bash fairseq-train dataflow_old2new \ --save-dir checkpoints \ --arch transformer_iwslt_de_en \ --optimizer adam \ --lr 0.0005 \ --max-tokens 4096 ``` ## License Apache-2.0
提供机构:
bangboom
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作