EER6/adlmc-stage1-10M
收藏Hugging Face2026-04-19 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/EER6/adlmc-stage1-10M
下载链接
链接失效反馈官方服务:
资源简介:
# EER6/adlmc-stage1-10M
A 10M-row subset of the stage-1 pretraining mixture from [DreamCoder](https://github.com/DreamLM/Dream-Coder/tree/main/base), sampled according to their original mixture weights:
| Source | Weight |
|---|---|
| stack_v2_smol | 40% |
| dclm_filtered | 17% |
| open_coder_anneal | 15% |
| stack_edu_py | 15% |
| finemath | 5% |
| openmathinstruct | 2.5% |
| tinygsm | 2.5% |
| wikibook | 2% |
| tulu | 0.5% |
| natural_reasoning | 0.5% |
Each row has a `text` column (the example) and a `source` column (which mixture component it came from). Rows are globally shuffled.
Generated by `dreamcoder/upload_stage1_subset.py`.
## Usage
```python
from datasets import load_dataset
ds = load_dataset("EER6/adlmc-stage1-10M", split="train", streaming=True)
for example in ds.take(5):
print(example["source"], example["text"][:100])
```
提供机构:
EER6



