synth-1B
收藏魔搭社区2025-12-05 更新2025-12-06 收录
下载链接:
https://modelscope.cn/datasets/codelion/synth-1B
下载链接
链接失效反馈官方服务:
资源简介:
# synth-1B
Sequential sample of the first 999,997,890 tokens from [PleIAs/SYNTH](https://huggingface.co/datasets/PleIAs/SYNTH).
## Dataset Details
- **Source**: PleIAs/SYNTH (500 parquet files, ~87B tokens total)
- **Sampling Method**: Sequential (first N documents)
- **Estimated Tokens**: 999,997,890
- **Documents**: 822,230
- **Token Estimation**: 4 characters ≈ 1 token
## Text Fields
Each document combines four fields from the original dataset:
- `query`: The question or prompt
- `query_seed_text`: Wikipedia or reference context
- `synthetic_reasoning`: Step-by-step reasoning trace
- `synthetic_answer`: Final answer
These are concatenated with double newlines to create comprehensive training examples.
## Usage
```python
from datasets import load_dataset
dataset = load_dataset("codelion/synth-1B")
```
## License
Same as source dataset (PleIAs/SYNTH).
# synth-1B
本数据集为取自[PleIAs/SYNTH](https://huggingface.co/datasets/PleIAs/SYNTH)的前999,997,890个Token的顺序采样样本。
## 数据集详情
- **数据来源**:PleIAs/SYNTH(包含500个Parquet文件,总Token数约870亿)
- **采样方式**:顺序采样(选取前N个文档)
- **预估Token数**:999,997,890
- **文档数量**:822,230
- **Token换算规则**:约4个字符对应1个Token
## 文本字段
每个文档合并了原始数据集的四个字段:
- `query`:问题或提示词
- `query_seed_text`:维基百科或参考上下文
- `synthetic_reasoning`:分步推理轨迹
- `synthetic_answer`:最终答案
这些字段通过双换行符拼接,形成完整的训练样本。
## 使用方法
python
from datasets import load_dataset
dataset = load_dataset("codelion/synth-1B")
## 授权协议
与原始数据集PleIAs/SYNTH保持一致。
提供机构:
maas
创建时间:
2025-11-12



