viniciusxpb/scaffold-tokens-dataset

Name: viniciusxpb/scaffold-tokens-dataset
Creator: viniciusxpb
Published: 2026-03-15 01:27:26
License: 暂无描述

Hugging Face2026-03-15 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/viniciusxpb/scaffold-tokens-dataset

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc0-1.0 language: - pt tags: - text-generation - scaffold-tokens - countdown-tokens - length-control - gpt2 - portuguese - news pretty_name: Scaffold Tokens Dataset size_categories: - 100M<n<1B models: - viniciusxpb/scaffold-gpt2-pt --- # Scaffold Tokens Dataset Pre-tokenized Portuguese news articles with **scaffold (countdown) tokens** for training language models with perfect length control. ## Resources | Resource | Link | |----------|------| | **Pre-trained model** | [viniciusxpb/scaffold-gpt2-pt](https://huggingface.co/viniciusxpb/scaffold-gpt2-pt) | | **Training code** | [github.com/viniciusxpb/scaffold-tokens](https://github.com/viniciusxpb/scaffold-tokens) | | **Author** | [Vinícius França](https://www.linkedin.com/in/vinicius-franca-dev/) | ## What are Scaffold Tokens? Each word in the text is preceded by a countdown token `<ff_N>` that tells the model how many words remain until the end of the document: ``` <ff_6> O <ff_5> presidente <ff_4> anunciou <ff_3> novas <ff_2> medidas <ff_1> econômicas <ff_0> . ``` The model learns to use this signal for **exact length control** and **emergent structural planning** — starting with introductory language when `ff` is high and shifting to conclusions when `ff` approaches zero. ### Original Text Format (JSON) Before being converted to binary shards, each article is stored as JSON with the countdown annotations: ```json { "id": 1, "content-ff": "<ff_630> Com <ff_629> a <ff_628> possibilidade <ff_627> de <ff_626> uma <ff_625> condenação ... <ff_1> financeiro <ff_0> ." } ``` During shard creation, each `<ff_N>` is converted to a numeric token ID (`50257 + N`) and each word is encoded with GPT-2 BPE. ## Dataset Details | Field | Value | |-------|-------| | **Source** | [Folha de S.Paulo news articles](https://www.kaggle.com/datasets/marlesson/news-of-the-site-folhauol) (public domain) | | **Language** | pt-BR | | **Total tokens** | ~208M | | **Tokenizer** | tiktoken GPT-2 (50,257 BPE tokens) | | **Format** | Binary shards (uint16) | | **License** | CC0 (Public Domain) | ## Files ``` data/ ├── train/ │ ├── shard_000.bin (191 MB, 100M tokens) │ └── shard_001.bin (186 MB, 97M tokens) └── val/ └── shard_000.bin (20 MB, 10M tokens) ``` ## Shard Format Each `.bin` file is a flat array of `uint16` values with a header: - **Header:** 256 × int32 values - `[0]` = magic number (`20240520`) - `[1]` = version (`1`) - `[2]` = token count - **Body:** uint16 token IDs ### Token Layout ``` IDs 0–50256: Standard GPT-2 BPE tokens (tiktoken) ID 50256: EOT (end of text, separates documents) IDs 50257–51256: <ff_0> through <ff_999> (countdown tokens) ``` ### Pattern within documents ``` [ff_id] [bpe_tok...] [ff_id] [bpe_tok...] ... [EOT] ``` Each `ff` token precedes one word. Since Portuguese words often split into multiple BPE subwords, the spacing between `ff` tokens is variable (1–4+ BPE tokens per word). ## How to Load ```python import numpy as np def load_shard(path): header = np.fromfile(path, dtype=np.int32, count=256) assert header[0] == 20240520, "Invalid magic number" token_count = header[2] tokens = np.fromfile(path, dtype=np.uint16, offset=1024) assert len(tokens) == token_count return tokens ``` ## How to Decode ```python import tiktoken enc = tiktoken.get_encoding("gpt2") FF_BASE = 50257 def decode_shard(tokens): words = [] current_word_tokens = [] for t in tokens: if FF_BASE <= t <= 51256: if current_word_tokens: words.append(enc.decode(current_word_tokens)) current_word_tokens = [] elif t == 50256: # EOT if current_word_tokens: words.append(enc.decode(current_word_tokens)) current_word_tokens = [] words.append("\n\n") else: current_word_tokens.append(t) return " ".join(words) ``` ## Quick Start ```bash # Clone the training repo git clone https://github.com/viniciusxpb/scaffold-tokens cd scaffold-tokens # Setup and download make setup make download # Validate and train make validate make train ``` ## Citation ```bibtex @misc{scaffold-tokens-2025, title={Scaffold Tokens: Teaching LLMs to Plan with Countdown Tokens}, author={Vinícius França}, year={2025}, url={https://github.com/viniciusxpb/scaffold-tokens} } ```

提供机构：

viniciusxpb

5,000+

优质数据集

54 个

任务类型

进入经典数据集