sensix-zo/Continued-Pre-Training-Vocab-Paite
收藏Hugging Face2026-04-10 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/sensix-zo/Continued-Pre-Training-Vocab-Paite
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mit
language:
- pck
pretty_name: Paite Vocabulary — CPT (paragraph text)
task_categories:
- text-generation
tags:
- paite
- continued-pretraining
- cpt
- gemma
- unsloth
size_categories:
- n<1K
---
# Paite Vocabulary — CPT Paragraph Text (`vocab_paite_2025-12-13_paragraph.jsonl`)
This file is **continued pretraining (CPT)** data: long **plain-text** sequences for causal language modeling. There is **no** instruction header—only a `text` field per line—so you can adapt token statistics and bilingual bridging patterns before instruction tuning.
## Dataset composition
* **Construction:** Built from the same Paite vocabulary sentence pairs as the SFT release. Each underlying example uses one of three **random** bridging templates between quoted English and Paite spans (periods **inside** those quoted spans are removed in processing). Sentences are separated with `.`, and the stream is **packed** into chunks of **up to 2048** `cl100k_base` tokens per line.
* **Coverage:** Broad vocabulary and short-sentence domains (daily life, food, travel, emotion, and related topics).
## File description
### `vocab_paite_2025-12-13_paragraph.jsonl`
| Property | Value |
|----------|--------|
| **Lines** | 284 |
| **Format** | JSONL — one object per line (UTF-8) |
| **Schema** | `text` (string) only |
**Example line:**
```json
{"text": "\"The knife is very sharp\"paite pau in\"tem a hiam mahmah\"a kichi hi. ..."}
```
* **JSON escapes:** `\"` in the file are required so literal `"` appear inside the JSON string; after `json.loads`, the text uses normal quotes.
## Technical training parameters (CPT)
* **Objective:** causal LM on `text` (standard next-token prediction).
* **Learning rate:** often lower than SFT (e.g. `1e-5`–`5e-5` — validate on loss).
* **Context length:** up to **2048** tokens per line by construction.
* **Packing:** optional; each line is already a long chunk.
## Usage notes
* **Format:** JSONL — one JSON object per line.
* **Loading:** Read each line, `json.loads(line)["text"]`, feed strings to your CPT dataloader.
* **License:** MIT (frontmatter); comply with your **base model** license (e.g. Gemma) for redistribution.
## Citation
Reference this artifact by filename and date: `vocab_paite_2025-12-13_paragraph`.
提供机构:
sensix-zo



