five

tranguyenxuwu/javicorpus-chatml-translation-mini

收藏
Hugging Face2026-04-02 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/tranguyenxuwu/javicorpus-chatml-translation-mini
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - ja - vi license: apache-2.0 task_categories: - translation tags: - machine-translation - japanese - vietnamese - chatml - sft - parallel-corpus - qwen3 - lora size_categories: - 10K<n<100K source_datasets: - ngovinhtn/JaViCorpus dataset_info: features: - name: messages list: - name: role dtype: string - name: content dtype: string splits: - name: train num_examples: 90000 - name: validation num_examples: 4736 - name: test num_examples: 493 --- # JaViCorpus ChatML Translation Dataset (Mini) A **90K-example subset** of the full JaViCorpus ChatML translation dataset, optimized for fast SFT training iterations on Google Colab (T4 GPU). For the full ~480K dataset, see [`tranguyenxuwu/javicorpus-chatml-translation`](https://huggingface.co/datasets/tranguyenxuwu/javicorpus-chatml-translation). ## Dataset Description This is a strategically sampled subset of the bidirectional Japanese↔Vietnamese translation dataset derived from the [ngovinhtn/JaViCorpus](https://github.com/ngovinhtn/JaViCorpus) parallel corpus collection. Each example is a 3-turn ChatML conversation (system → user → assistant). ### Source Corpora | Corpus | Sentence Pairs | Domain | |---|---|---| | TEDjavi_106K | 106,000 | TED Talk subtitles | | wiki_20K | 20,000 | Wikipedia / ALT Treebank | | Tatoeba_2K | 2,000 | Simple everyday sentences | | Glosbe282K | ~210,000 (filtered) | Mixed domains | ### Key Features - **Bidirectional**: Both JA→VI and VI→JA examples - **8 system prompt variants** per direction to prevent prompt overfitting - **Vietnamese Unicode normalization** applied - **Quality filtered**: Removes empty, punctuation-only, and misaligned pairs ## Data Format ```json { "messages": [ {"role": "system", "content": "Translate the given Japanese text to Vietnamese. Ensure the translation is fluent and preserves the original meaning."}, {"role": "user", "content": "今日はとても良い天気ですね。"}, {"role": "assistant", "content": "hôm nay thời tiết rất đẹp nhỉ."} ] } ``` ## Splits | Split | Examples | Description | |---|---|---| | `train` | 90,000 | Randomly sampled from full train split | | `validation` | 4,736 | Full validation split (unchanged) | | `test` | 493 | Full test split — TED dev2010 + tst2010 (held-out) | ## Usage ```python from datasets import load_dataset ds = load_dataset("tranguyenxuwu/javicorpus-chatml-translation-mini") print(ds["train"][0]) ``` ### For SFT Training (e.g., with Unsloth on Colab) ```python from trl import SFTTrainer trainer = SFTTrainer( model=model, train_dataset=ds["train"], eval_dataset=ds["validation"], max_seq_length=4096, ) trainer.train() ``` ### Training Time Estimates (Google Colab T4) | Steps | % of Data | Est. Time | |---|---|---| | 2,000 | 18% | ~3.5 hours | | 5,000 | 44% | ~8.5 hours | | 11,250 | 100% (1 epoch) | ~19 hours | ## Evaluation Results (2000 steps on this dataset) | Metric | Base Qwen3-8B | SFT-LoRA | |---|---|---| | BLEU (JA→VI) | 5.53 | **48.39** | | chrF (Overall) | 20.43 | **59.81** | | Success rate | 64% | **95%** | | Avg latency | 8.78s | **2.99s** | ## Intended Use - Fast SFT fine-tuning iterations for Japanese↔Vietnamese translation on Google Colab - Prototyping and hyperparameter tuning before training on the full dataset - Benchmarking translation quality improvements ## Limitations - **Single sentence pairs**: Each example is one sentence — models will produce single-sentence outputs - **Lowercase Vietnamese**: TED subtitle data is lowercase - **Subset bias**: Random sampling may slightly alter corpus distribution compared to the full dataset ## Citation ```bibtex @inproceedings{Ngo2018Combining, author = {Thi{-}Vinh Ngo and Thanh{-}Le Ha and Phuong{-}Thai Nguyen and Le{-}Minh Nguyen}, title = {Combining Advanced Methods in Japanese-Vietnamese Neural Machine Translation}, booktitle = {Proceedings of the 10th International Conference on Knowledge and Systems Engineering (KSE 2018)}, year = {2018}, address = {Hochiminh City, Vietnam}, url = {https://arxiv.org/pdf/1805.07133.pdf} } ``` ## License The code is licensed under Apache 2.0. The parallel corpora follow the licensing policies of their original sources (TED, Glosbe, OPUS, ALT) and are **restricted to research purposes only — no commercial usage permitted**.
提供机构:
tranguyenxuwu
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作