five

8Planetterraforming/Parameter-Golf-V9-FineWeb-Entropy-Selective-MicroMix

收藏
Hugging Face2026-04-20 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/8Planetterraforming/Parameter-Golf-V9-FineWeb-Entropy-Selective-MicroMix
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: mit language: - en tags: - parameter-golf - fineweb - bpb - compression - web-filtering - entropy-reduction - micro-mix - bpe size_categories: - 1K<n<10K pretty_name: Parameter Golf V9 FineWeb Entropy Selective MicroMix --- # Parameter-Golf-V9-FineWeb-Entropy-Selective-MicroMix ## Short description A tiny auxiliary micro-mix dataset for OpenAI Parameter Golf experiments, focused on FineWeb-like web signal extraction, boilerplate suppression, compact state continuity, BPE-safe plain text, and entropy-selective training probes. ## Extended description `Parameter-Golf-V9-FineWeb-Entropy-Selective-MicroMix` is designed after the V7/V8 experiments showed that larger synthetic or instruction-heavy auxiliary data can worsen FineWeb BPB. V9 is intentionally smaller, cleaner, and more FineWeb-like. The dataset combines the strongest usable ideas from the 8Planetterraforming dataset family: 1. **V6-style web signal filtering** Keep useful page content while suppressing cookie banners, repeated navigation, tracking text, newsletter prompts, advertisements, and low-value footer chrome. 2. **V5/V5x-style compact exactness** Preserve exact values, thresholds, names, dates, seeds, artifact limits, short state updates, and missing-evidence boundaries without adding invented fields. 3. **Presentation vs payload guardrails** Treat font size, layout, theme, and rendering metadata as presentation. Keep the semantic text payload compact and separate from visual styling. 4. **BPE-safe plain text** Avoid raw JSON, chat-format examples, markdown tables, long URLs, rare Unicode, and decorative symbols in the training substrate. 5. **Entropy-selective corpus design** Prefer short natural web paragraphs with common punctuation, stable units, and ordinary web vocabulary. ## Intended use Use this dataset only as a tiny auxiliary micro-mix with official FineWeb SP8192. Do **not** replace FineWeb. Recommended probes: - 99.995% FineWeb / 0.005% V9 - 99.990% FineWeb / 0.010% V9 - 99.980% FineWeb / 0.020% V9 - 99.950% FineWeb / 0.050% V9 Reject any mixture that worsens seed42 FineWeb validation BPB. ## Critical rule For training probes, use the plain text files: - `plain_text/v9_micro_0p005pct.txt` - `plain_text/v9_micro_0p01pct.txt` - `plain_text/v9_micro_0p02pct.txt` - `plain_text/v9_micro_0p05pct.txt` - `plain_text/train.txt` Do **not** train directly on raw JSONL. The JSONL files are for inspection, filtering, and reproducibility. ## Pass condition Reference seed42 baseline: ```text 1.08041364 BPB ``` Continue to 3-seed proof only if a seed42 probe improves below: ```text 1.08041364 BPB ``` Strong candidate threshold: ```text < 1.08000000 BPB ``` ## Dataset structure ```text README.md data/train.jsonl data/validation.jsonl data/test.jsonl plain_text/train.txt plain_text/validation.txt plain_text/test.txt plain_text/v9_micro_0p005pct.txt plain_text/v9_micro_0p01pct.txt plain_text/v9_micro_0p02pct.txt plain_text/v9_micro_0p05pct.txt scripts/build_v9_micro_mix.py scripts/run_v9_seed42_probe.sh docs/probe_plan.md docs/dataset_design.md source_sanitization.md upload_to_hf.md stats.json dataset_infos.json ``` ## What changed vs V8 V8 gave a weak positive signal at one setting, but not enough to justify a 3-seed proof. V9 is stricter: - less Parameter-Golf meta text; - fewer instruction-style examples; - stronger filtering of JSON/chat artifacts; - smaller recommended mix rates; - more ordinary web-paragraph style; - no claim that the dataset already beats SOTA. ## Safety and correctness This dataset does not claim that font size reduces language-model token count. Font size can affect document rendering or DOCX metadata, but LM compression depends on the text/token payload. The correct compression target is removing repeated page chrome, styling noise, tracking strings, and low-value boilerplate while preserving the semantic payload. ## License MIT
提供机构:
8Planetterraforming
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作