mxxsc/zip2zip-plus-mixture-partitioned
收藏Hugging Face2026-04-20 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/mxxsc/zip2zip-plus-mixture-partitioned
下载链接
链接失效反馈官方服务:
资源简介:
---
license: other
task_categories:
- text-generation
language:
- en
- multilingual
tags:
- pretraining
- fineweb
- code
- math
- multilingual
pretty_name: Zip2Zip Plus Mixture Partitioned
size_categories:
- 100B<n<1T
dataset_info:
features:
- name: text
dtype: string
- name: source
dtype: string
splits:
- name: train
num_bytes: 907324781210
num_examples: 190184417
download_size: 475006167489
dataset_size: 907324781210
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
---
# Zip2Zip Plus Mixture Partitioned
This dataset is a partitioned pretraining-data mixture built for zip2zip language-model pretraining.
The mixture is byte-balanced across four top-level domains:
| Domain | Source | Target byte ratio |
|---|---|---:|
| General | `HuggingFaceFW/fineweb-edu`, `sample-100BT` | 50% |
| Code | `bigcode/the-stack-dedup` | 20% |
| Math | `HuggingFaceTB/finemath`, `finemath-3plus` | 10% |
| Multilingual | `epfml/FineWeb2-HQ`, 20 language subsets | 20% |
The uploaded layout is partitioned by source domain. It is intended as the full source-partitioned version from which remixed training datasets and token-budget subsets can be derived.
## Dataset Structure
Each row has the schema:
```python
{
"text": str,
"source": str
}
```
The `source` field preserves source detail, for example:
- `fineweb-edu::sample-100BT`
- `the-stack-dedup`
- `finemath::finemath-3plus`
- `fineweb2-hq::deu_Latn`
- `fineweb2-hq::cmn_Hani`
## Multilingual Subsets
The multilingual portion uses 20 FineWeb2-HQ language subsets, sampled according to their row counts:
- `arb_Arab`
- `ces_Latn`
- `cmn_Hani`
- `dan_Latn`
- `deu_Latn`
- `ell_Grek`
- `fas_Arab`
- `fra_Latn`
- `hun_Latn`
- `ind_Latn`
- `ita_Latn`
- `jpn_Jpan`
- `nld_Latn`
- `pol_Latn`
- `por_Latn`
- `rus_Cyrl`
- `spa_Latn`
- `swe_Latn`
- `tur_Latn`
- `vie_Latn`
## Intended Use
This dataset is intended for research on language-model pretraining and zip2zip-style training pipelines.
Typical workflow:
1. Build or download this partitioned mixture.
2. Count tokens with the target tokenizer.
3. Remix into flat mixed shards.
4. Create token-budget subsets, such as a 20B-token subset.
5. Use the resulting flat mixed dataset for pretraining.
## Loading
```python
from datasets import load_dataset
ds = load_dataset("mxxsc/zip2zip-plus-mixture-partitioned", split="train")
print(ds[0])
```
## Data Construction
The mixture was constructed by streaming source datasets and writing incremental compressed JSONL shards. The top-level mixture ratios are byte-based, not token-based.
The multilingual domain is internally allocated according to language subset row counts.
## Caveats
This dataset inherits the quality, filtering, licensing, and safety properties of its upstream datasets. Users should consult the original dataset cards before redistribution or downstream deployment.
The mixture ratios are based on written text bytes. Token ratios may differ depending on the tokenizer.
## Source Datasets
- `HuggingFaceFW/fineweb-edu`
- `bigcode/the-stack-dedup`
- `HuggingFaceTB/finemath`
- `epfml/FineWeb2-HQ`
提供机构:
mxxsc



