epfl-dlab/zip2zip-plus-mixture-full

Name: epfl-dlab/zip2zip-plus-mixture-full
Creator: epfl-dlab
Published: 2026-04-20 08:15:06
License: 暂无描述

Hugging Face2026-04-20 更新2026-04-26 收录

下载链接：

https://hf-mirror.com/datasets/epfl-dlab/zip2zip-plus-mixture-full

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: other task_categories: - text-generation language: - en - multilingual tags: - pretraining - language-modeling - mixture - zip2zip - fineweb - code - math - multilingual pretty_name: Zip2Zip Plus Mixture Full size_categories: - 100B<n<1T --- # Zip2Zip Plus Mixture Full This dataset is the full Zip2Zip Plus pretraining-data mixture for zip2zip language-model pretraining. It combines general web text, code, math, and multilingual web text with byte-based top-level mixture ratios. | Domain | Source | Target byte ratio | |---|---|---:| | General | `HuggingFaceFW/fineweb-edu`, `sample-100BT` | 50% | | Code | `bigcode/the-stack-dedup` | 20% | | Math | `HuggingFaceTB/finemath`, `finemath-3plus` | 10% | | Multilingual | `epfml/FineWeb2-HQ`, 20 language subsets | 20% | ## Dataset Size Final token-count statistics: | Metric | Value | |---|---:| | Tokens | 218.27B | | Rows | 190,184,417 | | Text bytes | 860.00 GiB | | Counted shards | 1,731 | Tokens were counted with the Llama 3.1 tokenizer: - tokenizer: `meta-llama/Llama-3.1-8B` - `add_special_tokens=False` The local token-count output was written to: ```text /iopsstor/scratch/cscs/mxx/zip2zip-pretraining-data/mixture-v1/token_counts.json ``` ## Dataset Structure Each row has the schema: ```python { "text": str, "source": str } ``` The `source` field preserves source detail, for example: - `fineweb-edu::sample-100BT` - `the-stack-dedup` - `finemath::finemath-3plus` - `fineweb2-hq::deu_Latn` - `fineweb2-hq::cmn_Hani` ## Multilingual Subsets The multilingual portion uses 20 FineWeb2-HQ language subsets: - `arb_Arab` - `ces_Latn` - `cmn_Hani` - `dan_Latn` - `deu_Latn` - `ell_Grek` - `fas_Arab` - `fra_Latn` - `hun_Latn` - `ind_Latn` - `ita_Latn` - `jpn_Jpan` - `nld_Latn` - `pol_Latn` - `por_Latn` - `rus_Cyrl` - `spa_Latn` - `swe_Latn` - `tur_Latn` - `vie_Latn` ## Loading ```python from datasets import load_dataset ds = load_dataset("epfl-dlab/zip2zip-plus-mixture-full", split="train") print(ds[0]) ``` ## Intended Use This dataset is intended for research on language-model pretraining and zip2zip-style training pipelines. It can be used directly for pretraining or as the parent dataset for creating smaller token-budget subsets, such as a 20B-token subset. ## Data Construction The source datasets were streamed and written into incremental shards. The initial build was source-partitioned, and the final full dataset was remixed into flat mixed shards. The top-level ratios are byte-based, not token-based. Token ratios may differ depending on the tokenizer. ## Caveats This dataset inherits the quality, filtering, licensing, and safety properties of its upstream datasets. Users should consult the original dataset cards before redistribution or downstream deployment. The dataset may contain noisy, duplicated, sensitive, or otherwise undesirable content inherited from large-scale web, code, math, and multilingual corpora. ## Source Datasets - `HuggingFaceFW/fineweb-edu` - `bigcode/the-stack-dedup` - `HuggingFaceTB/finemath` - `epfml/FineWeb2-HQ` ```

提供机构：

epfl-dlab

5,000+

优质数据集

54 个

任务类型

进入经典数据集