8Planetterraforming/Parameter-Golf-V9-FineWeb-Entropy-Selective-MicroMix
收藏Hugging Face2026-04-20 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/8Planetterraforming/Parameter-Golf-V9-FineWeb-Entropy-Selective-MicroMix
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mit
language:
- en
tags:
- parameter-golf
- fineweb
- bpb
- compression
- web-filtering
- entropy-reduction
- micro-mix
- bpe
size_categories:
- 1K<n<10K
pretty_name: Parameter Golf V9 FineWeb Entropy Selective MicroMix
---
# Parameter-Golf-V9-FineWeb-Entropy-Selective-MicroMix
## Short description
A tiny auxiliary micro-mix dataset for OpenAI Parameter Golf experiments, focused on FineWeb-like web signal extraction, boilerplate suppression, compact state continuity, BPE-safe plain text, and entropy-selective training probes.
## Extended description
`Parameter-Golf-V9-FineWeb-Entropy-Selective-MicroMix` is designed after the V7/V8 experiments showed that larger synthetic or instruction-heavy auxiliary data can worsen FineWeb BPB.
V9 is intentionally smaller, cleaner, and more FineWeb-like.
The dataset combines the strongest usable ideas from the 8Planetterraforming dataset family:
1. **V6-style web signal filtering**
Keep useful page content while suppressing cookie banners, repeated navigation, tracking text, newsletter prompts, advertisements, and low-value footer chrome.
2. **V5/V5x-style compact exactness**
Preserve exact values, thresholds, names, dates, seeds, artifact limits, short state updates, and missing-evidence boundaries without adding invented fields.
3. **Presentation vs payload guardrails**
Treat font size, layout, theme, and rendering metadata as presentation. Keep the semantic text payload compact and separate from visual styling.
4. **BPE-safe plain text**
Avoid raw JSON, chat-format examples, markdown tables, long URLs, rare Unicode, and decorative symbols in the training substrate.
5. **Entropy-selective corpus design**
Prefer short natural web paragraphs with common punctuation, stable units, and ordinary web vocabulary.
## Intended use
Use this dataset only as a tiny auxiliary micro-mix with official FineWeb SP8192. Do **not** replace FineWeb.
Recommended probes:
- 99.995% FineWeb / 0.005% V9
- 99.990% FineWeb / 0.010% V9
- 99.980% FineWeb / 0.020% V9
- 99.950% FineWeb / 0.050% V9
Reject any mixture that worsens seed42 FineWeb validation BPB.
## Critical rule
For training probes, use the plain text files:
- `plain_text/v9_micro_0p005pct.txt`
- `plain_text/v9_micro_0p01pct.txt`
- `plain_text/v9_micro_0p02pct.txt`
- `plain_text/v9_micro_0p05pct.txt`
- `plain_text/train.txt`
Do **not** train directly on raw JSONL. The JSONL files are for inspection, filtering, and reproducibility.
## Pass condition
Reference seed42 baseline:
```text
1.08041364 BPB
```
Continue to 3-seed proof only if a seed42 probe improves below:
```text
1.08041364 BPB
```
Strong candidate threshold:
```text
< 1.08000000 BPB
```
## Dataset structure
```text
README.md
data/train.jsonl
data/validation.jsonl
data/test.jsonl
plain_text/train.txt
plain_text/validation.txt
plain_text/test.txt
plain_text/v9_micro_0p005pct.txt
plain_text/v9_micro_0p01pct.txt
plain_text/v9_micro_0p02pct.txt
plain_text/v9_micro_0p05pct.txt
scripts/build_v9_micro_mix.py
scripts/run_v9_seed42_probe.sh
docs/probe_plan.md
docs/dataset_design.md
source_sanitization.md
upload_to_hf.md
stats.json
dataset_infos.json
```
## What changed vs V8
V8 gave a weak positive signal at one setting, but not enough to justify a 3-seed proof. V9 is stricter:
- less Parameter-Golf meta text;
- fewer instruction-style examples;
- stronger filtering of JSON/chat artifacts;
- smaller recommended mix rates;
- more ordinary web-paragraph style;
- no claim that the dataset already beats SOTA.
## Safety and correctness
This dataset does not claim that font size reduces language-model token count. Font size can affect document rendering or DOCX metadata, but LM compression depends on the text/token payload. The correct compression target is removing repeated page chrome, styling noise, tracking strings, and low-value boilerplate while preserving the semantic payload.
## License
MIT
提供机构:
8Planetterraforming



