five

sensix-zo/SFT-Paite_Translation_messaging-format

收藏
Hugging Face2026-04-10 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/sensix-zo/SFT-Paite_Translation_messaging-format
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: mit language: - pck pretty_name: Paite Vocabulary — SFT Translate-Only (Chat / Messages) task_categories: - text-generation tags: - paite - instruction-finetuning - sft - chat - messages - gemma - unsloth - translation size_categories: - 10K<n<100K --- # Paite Vocabulary — SFT Messages (`vocab_paite_2025-12-13_translate_only_SFT_messages.jsonl`) This file is the **chat / messaging** variant of the translate-only Paite vocabulary data. Each line is one JSON object with a **`messages`** array (user → model turn), aligned with **Gemma-style** and other trainers that expect `role` + `content` instead of separate `instruction` / `input` / `output` fields. It is produced from **`vocab_paite_2025-12-13_translate_only.jsonl`** by `process_vocab_paite.py` (same row order and count as the source). ## Dataset composition * **Task:** English-to-Paite translation; the user message carries the task wording plus the English source; the model message is the Paite reference translation. * **Coverage:** Same 26,502 examples as `translate_only` (kitchen, travel, emotion, daily life, and related domains). * **Roles:** Exactly two turns per line — **`user`** then **`model`** (assistant target for SFT). ## File description ### `vocab_paite_2025-12-13_translate_only_SFT_messages.jsonl` | Property | Value | |----------|--------| | **Lines** | 26,502 | | **Format** | JSONL (one JSON object per line, UTF-8) | | **Schema** | Top-level key `messages`: array of `{ "role": "user" \| "model", "content": string }` | **Example line:** ```json {"messages": [{"role": "user", "content": "Translate : The knife is very sharp The knife is very sharp."}, {"role": "model", "content": "tem a hiam mahmah."}]} ``` * **`messages[0]` (`user`):** Prompt string built from the original `instruction` and `input` (see `process_vocab_paite.py` for the exact transformation). * **`messages[1]` (`model`):** Paite translation — the supervised target for the assistant turn. ## Relationship to other files | File | Role | |------|------| | `vocab_paite_2025-12-13_translate_only.jsonl` | Alpaca-style **instruction / input / output** (source for this file). | | `vocab_paite_2025-12-13_translate_only_SFT_messages.jsonl` | **Chat messages** (this README). | | `README_vocab_paite_2025-12-13_translate_only.md` | Documents the Alpaca-format file. | Regenerate (after editing the script or source): ```bash python3 process_vocab_paite.py vocab_paite_2025-12-13_translate_only.jsonl ``` Outputs: `*_CPT_paragraphs.jsonl` and `*_SFT_messages.jsonl` next to the input filename. ## Technical training parameters (SFT) * **SFT learning rate:** `2e-5` (tune if unstable). * **LoRA rank (r):** `64` or `128` * **LoRA alpha (α):** `128` or `256` * **Context length:** `4096` is typical for Gemma-family fine-tunes; examples are short. * **Packing:** Enable where your stack supports it (e.g. Unsloth) for throughput. ## Usage notes * **Format:** JSONL — parse each line with `json.loads`, then pass `obj["messages"]` to your chat template or trainer. * **Gemma / HF:** Map `messages` to the model’s expected chat format (many trainers accept OpenAI-style `role` + `content` lists). * **License:** MIT (frontmatter); comply with your **base model** license (e.g. Gemma) for redistribution. ## Citation Reference this artifact by filename and date: `vocab_paite_2025-12-13_translate_only_SFT_messages`.
提供机构:
sensix-zo
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作