8Planetterraforming/Parameter-Golf-V8-WebSignal-BPE-Entropy-MicroMix
收藏Hugging Face2026-04-20 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/8Planetterraforming/Parameter-Golf-V8-WebSignal-BPE-Entropy-MicroMix
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mit
language:
- en
task_categories:
- text-generation
- text-classification
pretty_name: Parameter-Golf-V8-WebSignal-BPE-Entropy-MicroMix
tags:
- parameter-golf
- fineweb
- bpb
- micro-mix
- web-filtering
- compression
- bpe-aware
- auxiliary-training
size_categories:
- 1K<n<10K
---
# Parameter-Golf-V8-WebSignal-BPE-Entropy-MicroMix
## Short Description
Auxiliary micro-mix dataset for OpenAI Parameter Golf: FineWeb-style web-signal filtering, BPE-aware payload compression, boilerplate suppression, privacy-safe minimization, and exact compact-state reasoning for BPB-oriented probes.
## Extended Description
This is a **V8 auxiliary micro-mix dataset** for OpenAI Parameter Golf experiments. It is built from the strongest useful themes in the 8Planetterraforming dataset family:
- **V6 web/privacy filtering**: skip cookie banners, ads, navigation, repeated footers, irrelevant page chrome, and personal-data-heavy blocks unless they carry the actual signal.
- **V5 / V5x compression and compact-state exactness**: keep exact symbols, thresholds, paths, metrics, and short state updates while removing stale narrative scaffolding.
- **solutions-training-v4 style calibration**: preserve uncertainty, avoid hallucinated fields, and keep concise evidence rather than long explanations.
The dataset is designed for **Parameter Golf BPB probes**, not general chat finetuning.
## Why V8 Is Different From V7
V7 tested layout and font-size reasoning. That was useful as a concept check, but it was too instruction-like for FineWeb BPB and can shift the training distribution away from web text.
V8 is stricter:
1. It uses **plain web-like paragraphs** for the actual micro-mix.
2. It avoids raw JSON as training text.
3. It targets **web signal extraction**, not chat behavior.
4. It uses only tiny mix rates: **0.02%, 0.05%, 0.10%**.
5. It must be rejected immediately if seed42 BPB worsens.
## Intended Use
Use V8 only as a tiny auxiliary source on top of the official FineWeb SP8192 training stream.
Recommended probes:
- 99.98% FineWeb / 0.02% V8
- 99.95% FineWeb / 0.05% V8
- 99.90% FineWeb / 0.10% V8
Do **not** replace FineWeb with this dataset.
## Record-Oriented Guardrail
The current goal is not to make the model better at chat. The goal is to reduce **FineWeb validation BPB** under the official Parameter Golf constraints.
Reject any V8 mixture if:
- seed42 is worse than the current baseline,
- training exceeds the 10-minute 8xH100 budget,
- the artifact/code package exceeds 16 MB,
- validation/evaluation is modified,
- tokenizer scoring becomes unclear.
## Files
```text
data/train.jsonl
data/validation.jsonl
data/test.jsonl
data/plain_text/train.txt
data/plain_text/validation.txt
data/plain_text/test.txt
data/plain_text/v8_micro_0p02pct.txt
data/plain_text/v8_micro_0p05pct.txt
data/plain_text/v8_micro_0p10pct.txt
scripts/build_v8_micro_mix.py
scripts/run_v8_seed42_probe.sh
docs/probe_plan.md
docs/dataset_design.md
stats.json
dataset_infos.json
source_sanitization.md
upload_to_hf.md
```
## Important Training Note
For Parameter Golf probes, prefer the files under:
```text
data/plain_text/
```
Do not train on raw JSON unless your pipeline explicitly strips metadata and converts each record to plain FineWeb-like text.
## Suggested Hugging Face Title
`Parameter-Golf-V8-WebSignal-BPE-Entropy-MicroMix`
## Suggested Hugging Face Summary
Auxiliary V8 micro-mix dataset for OpenAI Parameter Golf. Combines V6 web-signal filtering, V5/V5x compact-state exactness, and calibration-style concise evidence into FineWeb-like plain text for tiny 0.02–0.10% BPB probes.
提供机构:
8Planetterraforming



