five

8Planetterraforming/parameter_golf_v14_quecto_recovery_tower_english

收藏
Hugging Face2026-04-26 更新2026-05-03 收录
下载链接:
https://hf-mirror.com/datasets/8Planetterraforming/parameter_golf_v14_quecto_recovery_tower_english
下载链接
链接失效反馈
官方服务:
资源简介:
这是一个仅限英文的辅助数据集,专为紧凑模型和每字节比特(BPB)导向的实验设计。V14版本的核心是压缩塔概念,强调存储可重用的结构而非平面字节,并引入了写入-删除-恢复的区分:训练可以将结构蒸馏到权重、码书、缓存、残差或生成器中,但只有所需信息被存储或通过可逆规则生成时,才能实现精确恢复。数据集是合成且经过清理的,不包含真实个人数据、事件证据、私人文件或验证文本。任务家族包括压缩塔V14、可恢复状态V14、隐私网络过滤V14、字母映射V14、SI前缀V14、参数高尔夫BPB V14和提示内存预算V14,覆盖压缩、恢复、隐私过滤、字母映射、单位前缀、参数优化和内存管理等多个方面。建议仅作为小规模辅助数据集使用,与主语料库进行少量令牌混合(如0.1%至0.5%),且不替代主语料库。重要限制是数据集不证明任意1TB数据可无损压缩为16MB,而是教导模型正确区分:在如此高压缩比下,任意无损压缩是不可能的,但结构化源可以通过紧凑规则、生成器、码书、残差和校验和来表示。

English-only auxiliary dataset for compact-model and BPB-oriented experiments. V14 centers on the compression tower idea: store reusable structure rather than every byte flatly. It also adds the write-delete-recover distinction: training can distill structure into weights, codebooks, caches, residuals, or generators, but exact recovery is possible only when the required information is stored or generated by a reversible rule. The dataset is synthetic and sanitized. It does not contain real personal data, real incident evidence, private files, or validation text. Task families include compression_tower_v14, recoverable_state_v14, privacy_web_filtering_v14, alphabet_mapping_v14, si_prefix_v14, parameter_golf_bpb_v14, and prompt_memory_budget_v14. Suggested use is as a small auxiliary dataset for experiments, starting with a 0.1% to 0.5% token mix against the main corpus, and not replacing the main corpus. Important limit: this dataset does not prove that arbitrary 1 TB can be losslessly compressed into 16 MB; it teaches the model the correct distinction between impossible arbitrary lossless compression and representation of structured sources via compact rules, generators, codebooks, residuals, and checksums.
提供机构:
8Planetterraforming
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作