five

8Planetterraforming/Parameter-Golf-V7-Resolution-Layout-Compression-Reasoning

收藏
Hugging Face2026-04-19 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/8Planetterraforming/Parameter-Golf-V7-Resolution-Layout-Compression-Reasoning
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - en license: mit tags: - parameter-golf - auxiliary-training - compression-reasoning - image-resolution - layout-filtering - multimodal-consistency - synthetic-data task_categories: - text-generation - question-answering size_categories: - 1K<n<10K pretty_name: Parameter-Golf-V7-Resolution-Layout-Compression-Reasoning --- # Parameter-Golf-V7-Resolution-Layout-Compression-Reasoning ## Overview This is a synthetic auxiliary dataset for OpenAI Parameter Golf style experiments. It is **not** a replacement for FineWeb. It is designed to test whether a tiny auxiliary mix can reduce wasted next-token probability around compression reasoning, web-page signal filtering, display-vs-payload confusion, and multimodal preview planning. The central lesson is deliberately narrow: > Smaller rendered objects can reduce image/video payload, but smaller font size does not reduce the underlying text tokens. For language-model BPB, the useful target is shorter cleaner content, less repeated boilerplate, better tokenizer fit, and fewer wrong continuations. ## Why V7 exists V6/B7 focuses on privacy filtering and web signal extraction. V7 extends that idea into resolution/layout/compression reasoning: - distinguish **presentation** from **payload**, - distinguish **preview filtering** from **final generation**, - preserve exact discrete labels in cube/alphabet/optics-style multimodal tasks, - avoid overclaiming from tiny file-size differences, - keep Parameter Golf aligned with FineWeb BPB. ## Relation to Parameter Golf Leaderboard performance is evaluated on FineWeb validation BPB. Therefore V7 should only be used as a small auxiliary probe: - start: 0.25% or 0.5% V7 mix, - conservative test: 1.0% V7 mix, - only try 2–3% if seed42 improves and 3-seed mean confirms it, - never replace FineWeb with V7. ## Schema Each JSONL row has: - `id` - `task` - `subcategory` - `input` - `target` - `source_theme` - `difficulty` - `parameter_golf_role` - `recommended_mix` ## Splits See `stats.json` for exact counts and hashes. ## File layout ```text data/train.jsonl data/validation.jsonl data/test.jsonl stats.json source_sanitization.md upload_to_hf.md convert_to_chat_format.py parameter_golf_v7_probe_plan.md ``` ## Important caveat This dataset does **not** train image generation directly. It trains text behavior around planning, compression accounting, and signal extraction. Any improvement for image/video/3D workflows would be indirect through better instruction following and fewer wasteful generations. # Parameter-Golf-V7-Bonus-Extended-Compression-Examples Additional bonus dataset for Parameter Golf V7. This file extends the V7 concept of: - resolution-aware filtering, - layout vs payload reasoning, - DOCX/font-size observations, - image preview budgeting, - FineWeb-style noise removal. --- ## Technical Note (Important) Font size can affect rendered documents (e.g. DOCX) due to style and layout metadata. However, it does **not directly reduce language-model token count**. For Parameter Golf, the correct interpretation is: - remove repeated templates, styling markup, tracking strings, and low-value page chrome; - use low-resolution previews only as a filtering stage; - preserve semantic payload and exact symbolic constraints; - keep FineWeb as the primary corpus; - use this dataset strictly as a small auxiliary probe. --- ## Recommended Mixing Ratios Start conservatively: - 99.75% main corpus / 0.25% V7 bonus - 99.50% main corpus / 0.50% V7 bonus - 99.00% main corpus / 1.00% V7 bonus Reject any mixture that degrades FineWeb seed BPB. --- ## Contents - `v7_bonus_extended_examples.jsonl` — 930 examples - `v7_bonus_stats.json` — counts and checksum - `v7_bonus_probe_plan.md` — minimal evaluation plan --- ## Hugging Face Title `Parameter-Golf-V7-Bonus-Extended-Compression-Examples` --- ## Short Description V7 bonus dataset for Parameter Golf: resolution preview filtering, DOCX/font-size layout reasoning, payload vs presentation separation, FineWeb noise filtering, and BPB-safe compression heuristics.
提供机构:
8Planetterraforming
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作