mansaripo/ClimbMix_SYNTH_shuffled
收藏Hugging Face2026-02-26 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/mansaripo/ClimbMix_SYNTH_shuffled
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: text
dtype: string
splits:
- name: train
num_bytes: 1132201222189
num_examples: 631223639
download_size: 1132201222189
dataset_size: 1132201222189
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
---
# ClimbMix + SYNTH Shuffled
A globally shuffled combination of
[gvlassis/ClimbMix](https://huggingface.co/datasets/gvlassis/ClimbMix) and
[PleIAs/SYNTH](https://huggingface.co/datasets/PleIAs/SYNTH),
interleaved by file-size weight.
| Source | Rows | Size | Weight |
|--------|------|------|--------|
| ClimbMix | 553,315,056 | ~955 GB | ~85.4% |
| SYNTH | 77,908,583 | ~164 GB | ~14.6% |
| **Total** | **631,223,639** | **~1,119 GB** | **100%** |
Both sources were individually shuffled before interleaving.
数据集信息:
特征:
- 名称:text
数据类型:字符串
数据划分:
- 名称:train(训练集)
字节数:1132201222189
样本数量:631223639
下载大小:1132201222189
数据集总大小:1132201222189
配置项:
- 配置名称:default(默认配置)
数据文件:
- 数据划分:train
文件路径:data/train-*
# ClimbMix + SYNTH 混洗数据集
本数据集为[gvlassis/ClimbMix](https://huggingface.co/datasets/gvlassis/ClimbMix)与[PleIAs/SYNTH](https://huggingface.co/datasets/PleIAs/SYNTH)的组合,按照文件大小权重进行交错排列,并完成全局混洗。
| 数据源 | 样本数量 | 数据大小 | 权重占比 |
|--------|----------|----------|----------|
| ClimbMix | 553,315,056 | ~955 GB | ~85.4% |
| SYNTH | 77,908,583 | ~164 GB | ~14.6% |
| **总计** | **631,223,639** | **~1,119 GB** | **100%** |
两个数据源在进行交错排列前均已各自完成混洗。
提供机构:
mansaripo



