tranguyenxuwu/javicorpus-chatml-translation-mini
收藏Hugging Face2026-04-02 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/tranguyenxuwu/javicorpus-chatml-translation-mini
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- ja
- vi
license: apache-2.0
task_categories:
- translation
tags:
- machine-translation
- japanese
- vietnamese
- chatml
- sft
- parallel-corpus
- qwen3
- lora
size_categories:
- 10K<n<100K
source_datasets:
- ngovinhtn/JaViCorpus
dataset_info:
features:
- name: messages
list:
- name: role
dtype: string
- name: content
dtype: string
splits:
- name: train
num_examples: 90000
- name: validation
num_examples: 4736
- name: test
num_examples: 493
---
# JaViCorpus ChatML Translation Dataset (Mini)
A **90K-example subset** of the full JaViCorpus ChatML translation dataset, optimized for fast SFT training iterations on Google Colab (T4 GPU).
For the full ~480K dataset, see [`tranguyenxuwu/javicorpus-chatml-translation`](https://huggingface.co/datasets/tranguyenxuwu/javicorpus-chatml-translation).
## Dataset Description
This is a strategically sampled subset of the bidirectional Japanese↔Vietnamese translation dataset derived from the [ngovinhtn/JaViCorpus](https://github.com/ngovinhtn/JaViCorpus) parallel corpus collection. Each example is a 3-turn ChatML conversation (system → user → assistant).
### Source Corpora
| Corpus | Sentence Pairs | Domain |
|---|---|---|
| TEDjavi_106K | 106,000 | TED Talk subtitles |
| wiki_20K | 20,000 | Wikipedia / ALT Treebank |
| Tatoeba_2K | 2,000 | Simple everyday sentences |
| Glosbe282K | ~210,000 (filtered) | Mixed domains |
### Key Features
- **Bidirectional**: Both JA→VI and VI→JA examples
- **8 system prompt variants** per direction to prevent prompt overfitting
- **Vietnamese Unicode normalization** applied
- **Quality filtered**: Removes empty, punctuation-only, and misaligned pairs
## Data Format
```json
{
"messages": [
{"role": "system", "content": "Translate the given Japanese text to Vietnamese. Ensure the translation is fluent and preserves the original meaning."},
{"role": "user", "content": "今日はとても良い天気ですね。"},
{"role": "assistant", "content": "hôm nay thời tiết rất đẹp nhỉ."}
]
}
```
## Splits
| Split | Examples | Description |
|---|---|---|
| `train` | 90,000 | Randomly sampled from full train split |
| `validation` | 4,736 | Full validation split (unchanged) |
| `test` | 493 | Full test split — TED dev2010 + tst2010 (held-out) |
## Usage
```python
from datasets import load_dataset
ds = load_dataset("tranguyenxuwu/javicorpus-chatml-translation-mini")
print(ds["train"][0])
```
### For SFT Training (e.g., with Unsloth on Colab)
```python
from trl import SFTTrainer
trainer = SFTTrainer(
model=model,
train_dataset=ds["train"],
eval_dataset=ds["validation"],
max_seq_length=4096,
)
trainer.train()
```
### Training Time Estimates (Google Colab T4)
| Steps | % of Data | Est. Time |
|---|---|---|
| 2,000 | 18% | ~3.5 hours |
| 5,000 | 44% | ~8.5 hours |
| 11,250 | 100% (1 epoch) | ~19 hours |
## Evaluation Results (2000 steps on this dataset)
| Metric | Base Qwen3-8B | SFT-LoRA |
|---|---|---|
| BLEU (JA→VI) | 5.53 | **48.39** |
| chrF (Overall) | 20.43 | **59.81** |
| Success rate | 64% | **95%** |
| Avg latency | 8.78s | **2.99s** |
## Intended Use
- Fast SFT fine-tuning iterations for Japanese↔Vietnamese translation on Google Colab
- Prototyping and hyperparameter tuning before training on the full dataset
- Benchmarking translation quality improvements
## Limitations
- **Single sentence pairs**: Each example is one sentence — models will produce single-sentence outputs
- **Lowercase Vietnamese**: TED subtitle data is lowercase
- **Subset bias**: Random sampling may slightly alter corpus distribution compared to the full dataset
## Citation
```bibtex
@inproceedings{Ngo2018Combining,
author = {Thi{-}Vinh Ngo and Thanh{-}Le Ha and Phuong{-}Thai Nguyen and Le{-}Minh Nguyen},
title = {Combining Advanced Methods in Japanese-Vietnamese Neural Machine Translation},
booktitle = {Proceedings of the 10th International Conference on Knowledge and Systems Engineering (KSE 2018)},
year = {2018},
address = {Hochiminh City, Vietnam},
url = {https://arxiv.org/pdf/1805.07133.pdf}
}
```
## License
The code is licensed under Apache 2.0. The parallel corpora follow the licensing policies of their original sources (TED, Glosbe, OPUS, ALT) and are **restricted to research purposes only — no commercial usage permitted**.
提供机构:
tranguyenxuwu



