viniciusxpb/scaffold-tokens-dataset
收藏Hugging Face2026-03-15 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/viniciusxpb/scaffold-tokens-dataset
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc0-1.0
language:
- pt
tags:
- text-generation
- scaffold-tokens
- countdown-tokens
- length-control
- gpt2
- portuguese
- news
pretty_name: Scaffold Tokens Dataset
size_categories:
- 100M<n<1B
models:
- viniciusxpb/scaffold-gpt2-pt
---
# Scaffold Tokens Dataset
Pre-tokenized Portuguese news articles with **scaffold (countdown) tokens** for training language models with perfect length control.
## Resources
| Resource | Link |
|----------|------|
| **Pre-trained model** | [viniciusxpb/scaffold-gpt2-pt](https://huggingface.co/viniciusxpb/scaffold-gpt2-pt) |
| **Training code** | [github.com/viniciusxpb/scaffold-tokens](https://github.com/viniciusxpb/scaffold-tokens) |
| **Author** | [Vinícius França](https://www.linkedin.com/in/vinicius-franca-dev/) |
## What are Scaffold Tokens?
Each word in the text is preceded by a countdown token `<ff_N>` that tells the model how many words remain until the end of the document:
```
<ff_6> O <ff_5> presidente <ff_4> anunciou <ff_3> novas <ff_2> medidas <ff_1> econômicas <ff_0> .
```
The model learns to use this signal for **exact length control** and **emergent structural planning** — starting with introductory language when `ff` is high and shifting to conclusions when `ff` approaches zero.
### Original Text Format (JSON)
Before being converted to binary shards, each article is stored as JSON with the countdown annotations:
```json
{
"id": 1,
"content-ff": "<ff_630> Com <ff_629> a <ff_628> possibilidade <ff_627> de <ff_626> uma <ff_625> condenação ... <ff_1> financeiro <ff_0> ."
}
```
During shard creation, each `<ff_N>` is converted to a numeric token ID (`50257 + N`) and each word is encoded with GPT-2 BPE.
## Dataset Details
| Field | Value |
|-------|-------|
| **Source** | [Folha de S.Paulo news articles](https://www.kaggle.com/datasets/marlesson/news-of-the-site-folhauol) (public domain) |
| **Language** | pt-BR |
| **Total tokens** | ~208M |
| **Tokenizer** | tiktoken GPT-2 (50,257 BPE tokens) |
| **Format** | Binary shards (uint16) |
| **License** | CC0 (Public Domain) |
## Files
```
data/
├── train/
│ ├── shard_000.bin (191 MB, 100M tokens)
│ └── shard_001.bin (186 MB, 97M tokens)
└── val/
└── shard_000.bin (20 MB, 10M tokens)
```
## Shard Format
Each `.bin` file is a flat array of `uint16` values with a header:
- **Header:** 256 × int32 values
- `[0]` = magic number (`20240520`)
- `[1]` = version (`1`)
- `[2]` = token count
- **Body:** uint16 token IDs
### Token Layout
```
IDs 0–50256: Standard GPT-2 BPE tokens (tiktoken)
ID 50256: EOT (end of text, separates documents)
IDs 50257–51256: <ff_0> through <ff_999> (countdown tokens)
```
### Pattern within documents
```
[ff_id] [bpe_tok...] [ff_id] [bpe_tok...] ... [EOT]
```
Each `ff` token precedes one word. Since Portuguese words often split into multiple BPE subwords, the spacing between `ff` tokens is variable (1–4+ BPE tokens per word).
## How to Load
```python
import numpy as np
def load_shard(path):
header = np.fromfile(path, dtype=np.int32, count=256)
assert header[0] == 20240520, "Invalid magic number"
token_count = header[2]
tokens = np.fromfile(path, dtype=np.uint16, offset=1024)
assert len(tokens) == token_count
return tokens
```
## How to Decode
```python
import tiktoken
enc = tiktoken.get_encoding("gpt2")
FF_BASE = 50257
def decode_shard(tokens):
words = []
current_word_tokens = []
for t in tokens:
if FF_BASE <= t <= 51256:
if current_word_tokens:
words.append(enc.decode(current_word_tokens))
current_word_tokens = []
elif t == 50256: # EOT
if current_word_tokens:
words.append(enc.decode(current_word_tokens))
current_word_tokens = []
words.append("\n\n")
else:
current_word_tokens.append(t)
return " ".join(words)
```
## Quick Start
```bash
# Clone the training repo
git clone https://github.com/viniciusxpb/scaffold-tokens
cd scaffold-tokens
# Setup and download
make setup
make download
# Validate and train
make validate
make train
```
## Citation
```bibtex
@misc{scaffold-tokens-2025,
title={Scaffold Tokens: Teaching LLMs to Plan with Countdown Tokens},
author={Vinícius França},
year={2025},
url={https://github.com/viniciusxpb/scaffold-tokens}
}
```
提供机构:
viniciusxpb



