sensix-zo/SFT-Paite_Translation

Name: sensix-zo/SFT-Paite_Translation
Creator: sensix-zo
Published: 2026-04-10 09:50:14
License: 暂无描述

Hugging Face2026-04-10 更新2026-04-12 收录

下载链接：

https://hf-mirror.com/datasets/sensix-zo/SFT-Paite_Translation

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: mit language: - pck pretty_name: Paite Vocabulary — SFT Translate-Only (Instruction / Input / Output) task_categories: - text-generation tags: - paite - instruction-finetuning - sft - gemma - unsloth - alpaca - translation size_categories: - 10K<n<100K --- # Paite Vocabulary — SFT Translate-Only (`vocab_paite_2025-12-13_translate_only.jsonl`) This dataset is a **filtered** slice of the Paite vocabulary instruction set for **supervised fine-tuning (SFT)**. Only rows whose **`instruction`** starts with **`Translate`** (case-insensitive) are kept, so every example is an English-to-Paite translation task. The classic **Instruction–Input–Output** structure matches Alpaca-style training on Gemma, Unsloth, and similar stacks. ## Dataset composition * **Task focus:** English appears in **`input`**; Paite is the target in **`output`**. **`instruction`** frames the task (e.g. `Translate … to Paite`). * **Coverage:** Broad vocabulary and short-sentence patterns (kitchen, travel, emotion, daily life, and related domains). * **Filtering:** Rows whose `instruction` did not start with `Translate` were removed so supervision stays a single task type. ## File description ### `vocab_paite_2025-12-13_translate_only.jsonl` | Property | Value | |----------|--------| | **Lines** | 26,502 | | **Format** | JSONL (one JSON object per line, UTF-8) | | **Schema** | `instruction` (string), `input` (string), `output` (string) | **Example line:** ```json {"instruction": "Translate The knife is very sharp to Paite", "input": "The knife is very sharp.", "output": "tem a hiam mahmah."} ``` * **`instruction`:** Always begins with `Translate` (after optional leading whitespace). * **`input`:** English phrase or sentence to translate. * **`output`:** Paite translation (supervision target). ## Relationship to the full release The parent file `vocab_paite_2025-12-13.jsonl` may include non-translate instructions. This **`_translate_only`** file is the subset for **translate-only** SFT. For **CPT** plain-text data from the same project, see **`README_vocab_paite_2025-12-13_paragraph.md`** and `vocab_paite_2025-12-13_paragraph.jsonl`. ## Technical training parameters (SFT) * **SFT learning rate:** `2e-5` (tune if loss is unstable). * **LoRA rank (r):** `64` or `128` * **LoRA alpha (α):** `128` or `256` * **Context length:** `4096` tokens is typical for Gemma/Unsloth (examples are short; packing may help throughput). * **Packing:** Enable where supported (e.g. Unsloth) for faster training. ## Usage notes * **Format:** JSONL — one record per line. * **Structure:** Each line has `instruction`, `input`, and `output`. * **Training:** Map `instruction` + `input` to the user/prompt and train on `output` per your chat or Alpaca template. * **License:** MIT (frontmatter); confirm compliance with your base model’s license (e.g. Gemma) before redistribution. ## Citation Reference this artifact by filename and date: `vocab_paite_2025-12-13_translate_only`.

提供机构：

sensix-zo

5,000+

优质数据集

54 个

任务类型

进入经典数据集