8Planetterraforming/Parameter-Golf-V8-WebSignal-BPE-Entropy-MicroMix

Name: 8Planetterraforming/Parameter-Golf-V8-WebSignal-BPE-Entropy-MicroMix
Creator: 8Planetterraforming
Published: 2026-04-20 07:13:42
License: 暂无描述

Hugging Face2026-04-20 更新2026-04-26 收录

下载链接：

https://hf-mirror.com/datasets/8Planetterraforming/Parameter-Golf-V8-WebSignal-BPE-Entropy-MicroMix

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: mit language: - en task_categories: - text-generation - text-classification pretty_name: Parameter-Golf-V8-WebSignal-BPE-Entropy-MicroMix tags: - parameter-golf - fineweb - bpb - micro-mix - web-filtering - compression - bpe-aware - auxiliary-training size_categories: - 1K<n<10K --- # Parameter-Golf-V8-WebSignal-BPE-Entropy-MicroMix ## Short Description Auxiliary micro-mix dataset for OpenAI Parameter Golf: FineWeb-style web-signal filtering, BPE-aware payload compression, boilerplate suppression, privacy-safe minimization, and exact compact-state reasoning for BPB-oriented probes. ## Extended Description This is a **V8 auxiliary micro-mix dataset** for OpenAI Parameter Golf experiments. It is built from the strongest useful themes in the 8Planetterraforming dataset family: - **V6 web/privacy filtering**: skip cookie banners, ads, navigation, repeated footers, irrelevant page chrome, and personal-data-heavy blocks unless they carry the actual signal. - **V5 / V5x compression and compact-state exactness**: keep exact symbols, thresholds, paths, metrics, and short state updates while removing stale narrative scaffolding. - **solutions-training-v4 style calibration**: preserve uncertainty, avoid hallucinated fields, and keep concise evidence rather than long explanations. The dataset is designed for **Parameter Golf BPB probes**, not general chat finetuning. ## Why V8 Is Different From V7 V7 tested layout and font-size reasoning. That was useful as a concept check, but it was too instruction-like for FineWeb BPB and can shift the training distribution away from web text. V8 is stricter: 1. It uses **plain web-like paragraphs** for the actual micro-mix. 2. It avoids raw JSON as training text. 3. It targets **web signal extraction**, not chat behavior. 4. It uses only tiny mix rates: **0.02%, 0.05%, 0.10%**. 5. It must be rejected immediately if seed42 BPB worsens. ## Intended Use Use V8 only as a tiny auxiliary source on top of the official FineWeb SP8192 training stream. Recommended probes: - 99.98% FineWeb / 0.02% V8 - 99.95% FineWeb / 0.05% V8 - 99.90% FineWeb / 0.10% V8 Do **not** replace FineWeb with this dataset. ## Record-Oriented Guardrail The current goal is not to make the model better at chat. The goal is to reduce **FineWeb validation BPB** under the official Parameter Golf constraints. Reject any V8 mixture if: - seed42 is worse than the current baseline, - training exceeds the 10-minute 8xH100 budget, - the artifact/code package exceeds 16 MB, - validation/evaluation is modified, - tokenizer scoring becomes unclear. ## Files ```text data/train.jsonl data/validation.jsonl data/test.jsonl data/plain_text/train.txt data/plain_text/validation.txt data/plain_text/test.txt data/plain_text/v8_micro_0p02pct.txt data/plain_text/v8_micro_0p05pct.txt data/plain_text/v8_micro_0p10pct.txt scripts/build_v8_micro_mix.py scripts/run_v8_seed42_probe.sh docs/probe_plan.md docs/dataset_design.md stats.json dataset_infos.json source_sanitization.md upload_to_hf.md ``` ## Important Training Note For Parameter Golf probes, prefer the files under: ```text data/plain_text/ ``` Do not train on raw JSON unless your pipeline explicitly strips metadata and converts each record to plain FineWeb-like text. ## Suggested Hugging Face Title `Parameter-Golf-V8-WebSignal-BPE-Entropy-MicroMix` ## Suggested Hugging Face Summary Auxiliary V8 micro-mix dataset for OpenAI Parameter Golf. Combines V6 web-signal filtering, V5/V5x compact-state exactness, and calibration-style concise evidence into FineWeb-like plain text for tiny 0.02–0.10% BPB probes.

提供机构：

8Planetterraforming

5,000+

优质数据集

54 个

任务类型

进入经典数据集