five

sensix-zo/Continued-Pre-Training-Vocab-Paite

收藏
Hugging Face2026-04-10 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/sensix-zo/Continued-Pre-Training-Vocab-Paite
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: mit language: - pck pretty_name: Paite Vocabulary — CPT (paragraph text) task_categories: - text-generation tags: - paite - continued-pretraining - cpt - gemma - unsloth size_categories: - n<1K --- # Paite Vocabulary — CPT Paragraph Text (`vocab_paite_2025-12-13_paragraph.jsonl`) This file is **continued pretraining (CPT)** data: long **plain-text** sequences for causal language modeling. There is **no** instruction header—only a `text` field per line—so you can adapt token statistics and bilingual bridging patterns before instruction tuning. ## Dataset composition * **Construction:** Built from the same Paite vocabulary sentence pairs as the SFT release. Each underlying example uses one of three **random** bridging templates between quoted English and Paite spans (periods **inside** those quoted spans are removed in processing). Sentences are separated with `.`, and the stream is **packed** into chunks of **up to 2048** `cl100k_base` tokens per line. * **Coverage:** Broad vocabulary and short-sentence domains (daily life, food, travel, emotion, and related topics). ## File description ### `vocab_paite_2025-12-13_paragraph.jsonl` | Property | Value | |----------|--------| | **Lines** | 284 | | **Format** | JSONL — one object per line (UTF-8) | | **Schema** | `text` (string) only | **Example line:** ```json {"text": "\"The knife is very sharp\"paite pau in\"tem a hiam mahmah\"a kichi hi. ..."} ``` * **JSON escapes:** `\"` in the file are required so literal `"` appear inside the JSON string; after `json.loads`, the text uses normal quotes. ## Technical training parameters (CPT) * **Objective:** causal LM on `text` (standard next-token prediction). * **Learning rate:** often lower than SFT (e.g. `1e-5`–`5e-5` — validate on loss). * **Context length:** up to **2048** tokens per line by construction. * **Packing:** optional; each line is already a long chunk. ## Usage notes * **Format:** JSONL — one JSON object per line. * **Loading:** Read each line, `json.loads(line)["text"]`, feed strings to your CPT dataloader. * **License:** MIT (frontmatter); comply with your **base model** license (e.g. Gemma) for redistribution. ## Citation Reference this artifact by filename and date: `vocab_paite_2025-12-13_paragraph`.
提供机构:
sensix-zo
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作