five

sensix-zo/SFT-Paite_Translation

收藏
Hugging Face2026-04-10 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/sensix-zo/SFT-Paite_Translation
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: mit language: - pck pretty_name: Paite Vocabulary — SFT Translate-Only (Instruction / Input / Output) task_categories: - text-generation tags: - paite - instruction-finetuning - sft - gemma - unsloth - alpaca - translation size_categories: - 10K<n<100K --- # Paite Vocabulary — SFT Translate-Only (`vocab_paite_2025-12-13_translate_only.jsonl`) This dataset is a **filtered** slice of the Paite vocabulary instruction set for **supervised fine-tuning (SFT)**. Only rows whose **`instruction`** starts with **`Translate`** (case-insensitive) are kept, so every example is an English-to-Paite translation task. The classic **Instruction–Input–Output** structure matches Alpaca-style training on Gemma, Unsloth, and similar stacks. ## Dataset composition * **Task focus:** English appears in **`input`**; Paite is the target in **`output`**. **`instruction`** frames the task (e.g. `Translate … to Paite`). * **Coverage:** Broad vocabulary and short-sentence patterns (kitchen, travel, emotion, daily life, and related domains). * **Filtering:** Rows whose `instruction` did not start with `Translate` were removed so supervision stays a single task type. ## File description ### `vocab_paite_2025-12-13_translate_only.jsonl` | Property | Value | |----------|--------| | **Lines** | 26,502 | | **Format** | JSONL (one JSON object per line, UTF-8) | | **Schema** | `instruction` (string), `input` (string), `output` (string) | **Example line:** ```json {"instruction": "Translate The knife is very sharp to Paite", "input": "The knife is very sharp.", "output": "tem a hiam mahmah."} ``` * **`instruction`:** Always begins with `Translate` (after optional leading whitespace). * **`input`:** English phrase or sentence to translate. * **`output`:** Paite translation (supervision target). ## Relationship to the full release The parent file `vocab_paite_2025-12-13.jsonl` may include non-translate instructions. This **`_translate_only`** file is the subset for **translate-only** SFT. For **CPT** plain-text data from the same project, see **`README_vocab_paite_2025-12-13_paragraph.md`** and `vocab_paite_2025-12-13_paragraph.jsonl`. ## Technical training parameters (SFT) * **SFT learning rate:** `2e-5` (tune if loss is unstable). * **LoRA rank (r):** `64` or `128` * **LoRA alpha (α):** `128` or `256` * **Context length:** `4096` tokens is typical for Gemma/Unsloth (examples are short; packing may help throughput). * **Packing:** Enable where supported (e.g. Unsloth) for faster training. ## Usage notes * **Format:** JSONL — one record per line. * **Structure:** Each line has `instruction`, `input`, and `output`. * **Training:** Map `instruction` + `input` to the user/prompt and train on `output` per your chat or Alpaca template. * **License:** MIT (frontmatter); confirm compliance with your base model’s license (e.g. Gemma) before redistribution. ## Citation Reference this artifact by filename and date: `vocab_paite_2025-12-13_translate_only`.
提供机构:
sensix-zo
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作