five

mxxsc/zip2zip-plus-mixture-partitioned

收藏
Hugging Face2026-04-20 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/mxxsc/zip2zip-plus-mixture-partitioned
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: other task_categories: - text-generation language: - en - multilingual tags: - pretraining - fineweb - code - math - multilingual pretty_name: Zip2Zip Plus Mixture Partitioned size_categories: - 100B<n<1T dataset_info: features: - name: text dtype: string - name: source dtype: string splits: - name: train num_bytes: 907324781210 num_examples: 190184417 download_size: 475006167489 dataset_size: 907324781210 configs: - config_name: default data_files: - split: train path: data/train-* --- # Zip2Zip Plus Mixture Partitioned This dataset is a partitioned pretraining-data mixture built for zip2zip language-model pretraining. The mixture is byte-balanced across four top-level domains: | Domain | Source | Target byte ratio | |---|---|---:| | General | `HuggingFaceFW/fineweb-edu`, `sample-100BT` | 50% | | Code | `bigcode/the-stack-dedup` | 20% | | Math | `HuggingFaceTB/finemath`, `finemath-3plus` | 10% | | Multilingual | `epfml/FineWeb2-HQ`, 20 language subsets | 20% | The uploaded layout is partitioned by source domain. It is intended as the full source-partitioned version from which remixed training datasets and token-budget subsets can be derived. ## Dataset Structure Each row has the schema: ```python { "text": str, "source": str } ``` The `source` field preserves source detail, for example: - `fineweb-edu::sample-100BT` - `the-stack-dedup` - `finemath::finemath-3plus` - `fineweb2-hq::deu_Latn` - `fineweb2-hq::cmn_Hani` ## Multilingual Subsets The multilingual portion uses 20 FineWeb2-HQ language subsets, sampled according to their row counts: - `arb_Arab` - `ces_Latn` - `cmn_Hani` - `dan_Latn` - `deu_Latn` - `ell_Grek` - `fas_Arab` - `fra_Latn` - `hun_Latn` - `ind_Latn` - `ita_Latn` - `jpn_Jpan` - `nld_Latn` - `pol_Latn` - `por_Latn` - `rus_Cyrl` - `spa_Latn` - `swe_Latn` - `tur_Latn` - `vie_Latn` ## Intended Use This dataset is intended for research on language-model pretraining and zip2zip-style training pipelines. Typical workflow: 1. Build or download this partitioned mixture. 2. Count tokens with the target tokenizer. 3. Remix into flat mixed shards. 4. Create token-budget subsets, such as a 20B-token subset. 5. Use the resulting flat mixed dataset for pretraining. ## Loading ```python from datasets import load_dataset ds = load_dataset("mxxsc/zip2zip-plus-mixture-partitioned", split="train") print(ds[0]) ``` ## Data Construction The mixture was constructed by streaming source datasets and writing incremental compressed JSONL shards. The top-level mixture ratios are byte-based, not token-based. The multilingual domain is internally allocated according to language subset row counts. ## Caveats This dataset inherits the quality, filtering, licensing, and safety properties of its upstream datasets. Users should consult the original dataset cards before redistribution or downstream deployment. The mixture ratios are based on written text bytes. Token ratios may differ depending on the tokenizer. ## Source Datasets - `HuggingFaceFW/fineweb-edu` - `bigcode/the-stack-dedup` - `HuggingFaceTB/finemath` - `epfml/FineWeb2-HQ`
提供机构:
mxxsc
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作