sensix-zo/SFT-Paite_Translation
收藏Hugging Face2026-04-10 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/sensix-zo/SFT-Paite_Translation
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mit
language:
- pck
pretty_name: Paite Vocabulary — SFT Translate-Only (Instruction / Input / Output)
task_categories:
- text-generation
tags:
- paite
- instruction-finetuning
- sft
- gemma
- unsloth
- alpaca
- translation
size_categories:
- 10K<n<100K
---
# Paite Vocabulary — SFT Translate-Only (`vocab_paite_2025-12-13_translate_only.jsonl`)
This dataset is a **filtered** slice of the Paite vocabulary instruction set for **supervised fine-tuning (SFT)**. Only rows whose **`instruction`** starts with **`Translate`** (case-insensitive) are kept, so every example is an English-to-Paite translation task. The classic **Instruction–Input–Output** structure matches Alpaca-style training on Gemma, Unsloth, and similar stacks.
## Dataset composition
* **Task focus:** English appears in **`input`**; Paite is the target in **`output`**. **`instruction`** frames the task (e.g. `Translate … to Paite`).
* **Coverage:** Broad vocabulary and short-sentence patterns (kitchen, travel, emotion, daily life, and related domains).
* **Filtering:** Rows whose `instruction` did not start with `Translate` were removed so supervision stays a single task type.
## File description
### `vocab_paite_2025-12-13_translate_only.jsonl`
| Property | Value |
|----------|--------|
| **Lines** | 26,502 |
| **Format** | JSONL (one JSON object per line, UTF-8) |
| **Schema** | `instruction` (string), `input` (string), `output` (string) |
**Example line:**
```json
{"instruction": "Translate The knife is very sharp to Paite", "input": "The knife is very sharp.", "output": "tem a hiam mahmah."}
```
* **`instruction`:** Always begins with `Translate` (after optional leading whitespace).
* **`input`:** English phrase or sentence to translate.
* **`output`:** Paite translation (supervision target).
## Relationship to the full release
The parent file `vocab_paite_2025-12-13.jsonl` may include non-translate instructions. This **`_translate_only`** file is the subset for **translate-only** SFT. For **CPT** plain-text data from the same project, see **`README_vocab_paite_2025-12-13_paragraph.md`** and `vocab_paite_2025-12-13_paragraph.jsonl`.
## Technical training parameters (SFT)
* **SFT learning rate:** `2e-5` (tune if loss is unstable).
* **LoRA rank (r):** `64` or `128`
* **LoRA alpha (α):** `128` or `256`
* **Context length:** `4096` tokens is typical for Gemma/Unsloth (examples are short; packing may help throughput).
* **Packing:** Enable where supported (e.g. Unsloth) for faster training.
## Usage notes
* **Format:** JSONL — one record per line.
* **Structure:** Each line has `instruction`, `input`, and `output`.
* **Training:** Map `instruction` + `input` to the user/prompt and train on `output` per your chat or Alpaca template.
* **License:** MIT (frontmatter); confirm compliance with your base model’s license (e.g. Gemma) before redistribution.
## Citation
Reference this artifact by filename and date: `vocab_paite_2025-12-13_translate_only`.
提供机构:
sensix-zo



