sensix-zo/SFT-Paite_Translation_messaging-format
收藏Hugging Face2026-04-10 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/sensix-zo/SFT-Paite_Translation_messaging-format
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mit
language:
- pck
pretty_name: Paite Vocabulary — SFT Translate-Only (Chat / Messages)
task_categories:
- text-generation
tags:
- paite
- instruction-finetuning
- sft
- chat
- messages
- gemma
- unsloth
- translation
size_categories:
- 10K<n<100K
---
# Paite Vocabulary — SFT Messages (`vocab_paite_2025-12-13_translate_only_SFT_messages.jsonl`)
This file is the **chat / messaging** variant of the translate-only Paite vocabulary data. Each line is one JSON object with a **`messages`** array (user → model turn), aligned with **Gemma-style** and other trainers that expect `role` + `content` instead of separate `instruction` / `input` / `output` fields.
It is produced from **`vocab_paite_2025-12-13_translate_only.jsonl`** by `process_vocab_paite.py` (same row order and count as the source).
## Dataset composition
* **Task:** English-to-Paite translation; the user message carries the task wording plus the English source; the model message is the Paite reference translation.
* **Coverage:** Same 26,502 examples as `translate_only` (kitchen, travel, emotion, daily life, and related domains).
* **Roles:** Exactly two turns per line — **`user`** then **`model`** (assistant target for SFT).
## File description
### `vocab_paite_2025-12-13_translate_only_SFT_messages.jsonl`
| Property | Value |
|----------|--------|
| **Lines** | 26,502 |
| **Format** | JSONL (one JSON object per line, UTF-8) |
| **Schema** | Top-level key `messages`: array of `{ "role": "user" \| "model", "content": string }` |
**Example line:**
```json
{"messages": [{"role": "user", "content": "Translate : The knife is very sharp The knife is very sharp."}, {"role": "model", "content": "tem a hiam mahmah."}]}
```
* **`messages[0]` (`user`):** Prompt string built from the original `instruction` and `input` (see `process_vocab_paite.py` for the exact transformation).
* **`messages[1]` (`model`):** Paite translation — the supervised target for the assistant turn.
## Relationship to other files
| File | Role |
|------|------|
| `vocab_paite_2025-12-13_translate_only.jsonl` | Alpaca-style **instruction / input / output** (source for this file). |
| `vocab_paite_2025-12-13_translate_only_SFT_messages.jsonl` | **Chat messages** (this README). |
| `README_vocab_paite_2025-12-13_translate_only.md` | Documents the Alpaca-format file. |
Regenerate (after editing the script or source):
```bash
python3 process_vocab_paite.py vocab_paite_2025-12-13_translate_only.jsonl
```
Outputs: `*_CPT_paragraphs.jsonl` and `*_SFT_messages.jsonl` next to the input filename.
## Technical training parameters (SFT)
* **SFT learning rate:** `2e-5` (tune if unstable).
* **LoRA rank (r):** `64` or `128`
* **LoRA alpha (α):** `128` or `256`
* **Context length:** `4096` is typical for Gemma-family fine-tunes; examples are short.
* **Packing:** Enable where your stack supports it (e.g. Unsloth) for throughput.
## Usage notes
* **Format:** JSONL — parse each line with `json.loads`, then pass `obj["messages"]` to your chat template or trainer.
* **Gemma / HF:** Map `messages` to the model’s expected chat format (many trainers accept OpenAI-style `role` + `content` lists).
* **License:** MIT (frontmatter); comply with your **base model** license (e.g. Gemma) for redistribution.
## Citation
Reference this artifact by filename and date: `vocab_paite_2025-12-13_translate_only_SFT_messages`.
提供机构:
sensix-zo



