epfl-dlab/zip2zip-plus-mixture-20b

Name: epfl-dlab/zip2zip-plus-mixture-20b
Creator: epfl-dlab
Published: 2026-04-20 08:09:34
License: 暂无描述

Hugging Face2026-04-20 更新2026-04-26 收录

下载链接：

https://hf-mirror.com/datasets/epfl-dlab/zip2zip-plus-mixture-20b

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: other task_categories: - text-generation language: - en - multilingual tags: - pretraining - language-modeling pretty_name: Zip2Zip Plus Mixture 20B size_categories: - 10B<n<100B --- # Zip2Zip Plus Mixture 20B This dataset is a 20B-token pretraining subset built for zip2zip language-model pretraining. It is derived from the full Zip2Zip Plus mixture, which is byte-balanced across four top-level domains: | Domain | Source | Target byte ratio | |---|---|---:| | General | `HuggingFaceFW/fineweb-edu`, `sample-100BT` | 50% | | Code | `bigcode/the-stack-dedup` | 20% | | Math | `HuggingFaceTB/finemath`, `finemath-3plus` | 10% | | Multilingual | `epfml/FineWeb2-HQ`, 20 language subsets | 20% | The subset was created from remixed flat shards using Llama 3.1 token counting. ## Dataset Size Final subset statistics: | Metric | Value | |---|---:| | Tokens | 20,000,000,613 | | Rows | 18,808,438 | | Text bytes | 81.15 GiB | Tokens were counted with: - `meta-llama/Llama-3.1-8B` - `add_special_tokens=False` The final token count is slightly above 20B because the subset is cut at row boundaries. ## Dataset Structure Each row has the schema: ```python { "text": str, "source": str } ``` The `source` field preserves source detail, for example: - `fineweb-edu::sample-100BT` - `the-stack-dedup` - `finemath::finemath-3plus` - `fineweb2-hq::deu_Latn` - `fineweb2-hq::cmn_Hani` ## Source Mixture The parent full mixture uses byte-based top-level ratios: - general: 50% - code: 20% - math: 10% - multilingual: 20% The multilingual portion uses 20 FineWeb2-HQ language subsets: - `arb_Arab` - `ces_Latn` - `cmn_Hani` - `dan_Latn` - `deu_Latn` - `ell_Grek` - `fas_Arab` - `fra_Latn` - `hun_Latn` - `ind_Latn` - `ita_Latn` - `jpn_Jpan` - `nld_Latn` - `pol_Latn` - `por_Latn` - `rus_Cyrl` - `spa_Latn` - `swe_Latn` - `tur_Latn` - `vie_Latn` ## Intended Use This dataset is intended for research on language-model pretraining and zip2zip-style training pipelines. Example loading: ```python from datasets import load_dataset ds = load_dataset("epfl-dlab/zip2zip-plus-mixture-20b", split="train") print(ds[0]) ``` ## Data Construction The full mixture was first built as a source-partitioned dataset, then remixed into flat mixed shards. This 20B-token subset was created by reading the remixed shards in order, tokenizing text with the Llama 3.1 tokenizer, and writing rows until the target token budget was reached. The main mixture ratios are byte-based, not token-based. The final 20B subset is therefore expected to approximately preserve the parent mixture proportions, but exact token-level ratios may differ. ## Caveats This dataset inherits the quality, filtering, licensing, and safety properties of its upstream datasets. Users should consult the original dataset cards before redistribution or downstream deployment. The dataset may contain noisy, duplicated, sensitive, or otherwise undesirable content inherited from large-scale web, code, math, and multilingual corpora. ## Source Datasets - `HuggingFaceFW/fineweb-edu` - `bigcode/the-stack-dedup` - `HuggingFaceTB/finemath` - `epfml/FineWeb2-HQ` ```

提供机构：

epfl-dlab

5,000+

优质数据集

54 个

任务类型

进入经典数据集