epfl-dlab/zip2zip-plus-mixture-20b
收藏Hugging Face2026-04-20 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/epfl-dlab/zip2zip-plus-mixture-20b
下载链接
链接失效反馈官方服务:
资源简介:
---
license: other
task_categories:
- text-generation
language:
- en
- multilingual
tags:
- pretraining
- language-modeling
pretty_name: Zip2Zip Plus Mixture 20B
size_categories:
- 10B<n<100B
---
# Zip2Zip Plus Mixture 20B
This dataset is a 20B-token pretraining subset built for zip2zip language-model pretraining.
It is derived from the full Zip2Zip Plus mixture, which is byte-balanced across four top-level domains:
| Domain | Source | Target byte ratio |
|---|---|---:|
| General | `HuggingFaceFW/fineweb-edu`, `sample-100BT` | 50% |
| Code | `bigcode/the-stack-dedup` | 20% |
| Math | `HuggingFaceTB/finemath`, `finemath-3plus` | 10% |
| Multilingual | `epfml/FineWeb2-HQ`, 20 language subsets | 20% |
The subset was created from remixed flat shards using Llama 3.1 token counting.
## Dataset Size
Final subset statistics:
| Metric | Value |
|---|---:|
| Tokens | 20,000,000,613 |
| Rows | 18,808,438 |
| Text bytes | 81.15 GiB |
Tokens were counted with:
- `meta-llama/Llama-3.1-8B`
- `add_special_tokens=False`
The final token count is slightly above 20B because the subset is cut at row boundaries.
## Dataset Structure
Each row has the schema:
```python
{
"text": str,
"source": str
}
```
The `source` field preserves source detail, for example:
- `fineweb-edu::sample-100BT`
- `the-stack-dedup`
- `finemath::finemath-3plus`
- `fineweb2-hq::deu_Latn`
- `fineweb2-hq::cmn_Hani`
## Source Mixture
The parent full mixture uses byte-based top-level ratios:
- general: 50%
- code: 20%
- math: 10%
- multilingual: 20%
The multilingual portion uses 20 FineWeb2-HQ language subsets:
- `arb_Arab`
- `ces_Latn`
- `cmn_Hani`
- `dan_Latn`
- `deu_Latn`
- `ell_Grek`
- `fas_Arab`
- `fra_Latn`
- `hun_Latn`
- `ind_Latn`
- `ita_Latn`
- `jpn_Jpan`
- `nld_Latn`
- `pol_Latn`
- `por_Latn`
- `rus_Cyrl`
- `spa_Latn`
- `swe_Latn`
- `tur_Latn`
- `vie_Latn`
## Intended Use
This dataset is intended for research on language-model pretraining and zip2zip-style training pipelines.
Example loading:
```python
from datasets import load_dataset
ds = load_dataset("epfl-dlab/zip2zip-plus-mixture-20b", split="train")
print(ds[0])
```
## Data Construction
The full mixture was first built as a source-partitioned dataset, then remixed into flat mixed shards. This 20B-token subset was created by reading the remixed shards in order, tokenizing text with the Llama 3.1 tokenizer, and writing rows until the target token budget was reached.
The main mixture ratios are byte-based, not token-based. The final 20B subset is therefore expected to approximately preserve the parent mixture proportions, but exact token-level ratios may differ.
## Caveats
This dataset inherits the quality, filtering, licensing, and safety properties of its upstream datasets. Users should consult the original dataset cards before redistribution or downstream deployment.
The dataset may contain noisy, duplicated, sensitive, or otherwise undesirable content inherited from large-scale web, code, math, and multilingual corpora.
## Source Datasets
- `HuggingFaceFW/fineweb-edu`
- `bigcode/the-stack-dedup`
- `HuggingFaceTB/finemath`
- `epfml/FineWeb2-HQ`
```
提供机构:
epfl-dlab



