tranguyenxuwu/javicorpus-chatml-translation

Name: tranguyenxuwu/javicorpus-chatml-translation
Creator: tranguyenxuwu
Published: 2026-04-02 17:11:31
License: 暂无描述

Hugging Face2026-04-02 更新2026-04-12 收录

下载链接：

https://hf-mirror.com/datasets/tranguyenxuwu/javicorpus-chatml-translation

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - ja - vi license: apache-2.0 task_categories: - translation tags: - machine-translation - japanese - vietnamese - chatml - sft - parallel-corpus - qwen3 - lora size_categories: - 100K<n<1M source_datasets: - ngovinhtn/JaViCorpus dataset_info: features: - name: messages list: - name: role dtype: string - name: content dtype: string splits: - name: train num_examples: 456000 - name: validation num_examples: 24000 - name: test num_examples: 986 --- # JaViCorpus ChatML Translation Dataset (Full) Bidirectional Japanese↔Vietnamese translation dataset in **ChatML format**, built from high-quality parallel corpora for SFT fine-tuning of large language models. ## Dataset Description This dataset contains ~480K ChatML-formatted translation examples derived from the [ngovinhtn/JaViCorpus](https://github.com/ngovinhtn/JaViCorpus) parallel corpus collection. Each example is a 3-turn conversation (system → user → assistant) where the user provides a sentence in the source language and the assistant provides the translation. ### Source Corpora | Corpus | Sentence Pairs | Domain | |---|---|---| | TEDjavi_106K | 106,000 | TED Talk subtitles | | wiki_20K | 20,000 | Wikipedia / ALT Treebank | | Tatoeba_2K | 2,000 | Simple everyday sentences | | Glosbe282K | ~210,000 (filtered) | Mixed domains | ### Key Features - **Bidirectional**: Each sentence pair generates both JA→VI and VI→JA examples (2× amplification) - **8 system prompt variants** per direction to prevent prompt overfitting - **Vietnamese Unicode normalization** applied via NFC standard - **Quality filtered**: Removes empty pairs, punctuation-only, and gross length misalignments (>10× ratio) ## Data Format Each example follows the ChatML conversation format: ```json { "messages": [ {"role": "system", "content": "Translate the given Japanese text to Vietnamese. Ensure the translation is fluent and preserves the original meaning."}, {"role": "user", "content": "今日はとても良い天気ですね。"}, {"role": "assistant", "content": "hôm nay thời tiết rất đẹp nhỉ."} ] } ``` ## Splits | Split | Examples | Description | |---|---|---| | `train` | ~456,000 | 95% of main corpora, shuffled | | `validation` | ~24,000 | 5% of main corpora, shuffled | | `test` | 986 | TED dev2010 + tst2010 (held-out) | ## Usage ```python from datasets import load_dataset ds = load_dataset("tranguyenxuwu/javicorpus-chatml-translation") print(ds["train"][0]) ``` ### For SFT Training (e.g., with TRL/Unsloth) ```python from trl import SFTTrainer trainer = SFTTrainer( model=model, train_dataset=ds["train"], eval_dataset=ds["validation"], max_seq_length=4096, ) trainer.train() ``` ## Intended Use - Fine-tuning LLMs for Japanese↔Vietnamese machine translation - Benchmarking translation quality on the JA-VI language pair - Research on low-resource Asian language pair translation ## Limitations - **Single sentence pairs**: Each example is one sentence, so models fine-tuned on this data will tend to produce single-sentence outputs - **Lowercase Vietnamese**: TED subtitle data is lowercase, which propagates to model outputs - **Domain bias**: Predominantly subtitles and encyclopedic text — may not generalize to legal, medical, or highly technical domains ## Citation If you use this dataset, please cite the original corpus: ```bibtex @inproceedings{Ngo2018Combining, author = {Thi{-}Vinh Ngo and Thanh{-}Le Ha and Phuong{-}Thai Nguyen and Le{-}Minh Nguyen}, title = {Combining Advanced Methods in Japanese-Vietnamese Neural Machine Translation}, booktitle = {Proceedings of the 10th International Conference on Knowledge and Systems Engineering (KSE 2018)}, year = {2018}, address = {Hochiminh City, Vietnam}, url = {https://arxiv.org/pdf/1805.07133.pdf} } ``` ## License The code is licensed under Apache 2.0. The parallel corpora follow the licensing policies of their original sources (TED, Glosbe, OPUS, ALT) and are **restricted to research purposes only — no commercial usage permitted**.

提供机构：

tranguyenxuwu

5,000+

优质数据集

54 个

任务类型

进入经典数据集