tranguyenxuwu/javicorpus-chatml-translation
收藏Hugging Face2026-04-02 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/tranguyenxuwu/javicorpus-chatml-translation
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- ja
- vi
license: apache-2.0
task_categories:
- translation
tags:
- machine-translation
- japanese
- vietnamese
- chatml
- sft
- parallel-corpus
- qwen3
- lora
size_categories:
- 100K<n<1M
source_datasets:
- ngovinhtn/JaViCorpus
dataset_info:
features:
- name: messages
list:
- name: role
dtype: string
- name: content
dtype: string
splits:
- name: train
num_examples: 456000
- name: validation
num_examples: 24000
- name: test
num_examples: 986
---
# JaViCorpus ChatML Translation Dataset (Full)
Bidirectional Japanese↔Vietnamese translation dataset in **ChatML format**, built from high-quality parallel corpora for SFT fine-tuning of large language models.
## Dataset Description
This dataset contains ~480K ChatML-formatted translation examples derived from the [ngovinhtn/JaViCorpus](https://github.com/ngovinhtn/JaViCorpus) parallel corpus collection.
Each example is a 3-turn conversation (system → user → assistant) where the user provides a sentence in the source language and the assistant provides the translation.
### Source Corpora
| Corpus | Sentence Pairs | Domain |
|---|---|---|
| TEDjavi_106K | 106,000 | TED Talk subtitles |
| wiki_20K | 20,000 | Wikipedia / ALT Treebank |
| Tatoeba_2K | 2,000 | Simple everyday sentences |
| Glosbe282K | ~210,000 (filtered) | Mixed domains |
### Key Features
- **Bidirectional**: Each sentence pair generates both JA→VI and VI→JA examples (2× amplification)
- **8 system prompt variants** per direction to prevent prompt overfitting
- **Vietnamese Unicode normalization** applied via NFC standard
- **Quality filtered**: Removes empty pairs, punctuation-only, and gross length misalignments (>10× ratio)
## Data Format
Each example follows the ChatML conversation format:
```json
{
"messages": [
{"role": "system", "content": "Translate the given Japanese text to Vietnamese. Ensure the translation is fluent and preserves the original meaning."},
{"role": "user", "content": "今日はとても良い天気ですね。"},
{"role": "assistant", "content": "hôm nay thời tiết rất đẹp nhỉ."}
]
}
```
## Splits
| Split | Examples | Description |
|---|---|---|
| `train` | ~456,000 | 95% of main corpora, shuffled |
| `validation` | ~24,000 | 5% of main corpora, shuffled |
| `test` | 986 | TED dev2010 + tst2010 (held-out) |
## Usage
```python
from datasets import load_dataset
ds = load_dataset("tranguyenxuwu/javicorpus-chatml-translation")
print(ds["train"][0])
```
### For SFT Training (e.g., with TRL/Unsloth)
```python
from trl import SFTTrainer
trainer = SFTTrainer(
model=model,
train_dataset=ds["train"],
eval_dataset=ds["validation"],
max_seq_length=4096,
)
trainer.train()
```
## Intended Use
- Fine-tuning LLMs for Japanese↔Vietnamese machine translation
- Benchmarking translation quality on the JA-VI language pair
- Research on low-resource Asian language pair translation
## Limitations
- **Single sentence pairs**: Each example is one sentence, so models fine-tuned on this data will tend to produce single-sentence outputs
- **Lowercase Vietnamese**: TED subtitle data is lowercase, which propagates to model outputs
- **Domain bias**: Predominantly subtitles and encyclopedic text — may not generalize to legal, medical, or highly technical domains
## Citation
If you use this dataset, please cite the original corpus:
```bibtex
@inproceedings{Ngo2018Combining,
author = {Thi{-}Vinh Ngo and Thanh{-}Le Ha and Phuong{-}Thai Nguyen and Le{-}Minh Nguyen},
title = {Combining Advanced Methods in Japanese-Vietnamese Neural Machine Translation},
booktitle = {Proceedings of the 10th International Conference on Knowledge and Systems Engineering (KSE 2018)},
year = {2018},
address = {Hochiminh City, Vietnam},
url = {https://arxiv.org/pdf/1805.07133.pdf}
}
```
## License
The code is licensed under Apache 2.0. The parallel corpora follow the licensing policies of their original sources (TED, Glosbe, OPUS, ALT) and are **restricted to research purposes only — no commercial usage permitted**.
提供机构:
tranguyenxuwu



