OpenBB/OpenBB-215K
收藏Hugging Face2026-02-26 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/OpenBB/OpenBB-215K
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- zh
- en
license: apache-2.0
task_categories:
- text-generation
- question-answering
size_categories:
- 100K<n<1M
tags:
- llm-training
- pretrain
- sft
- dpo
- reasoning
- chain-of-thought
- bilingual
pretty_name: OpenBB-215K
---
# OpenBB-215K v0.1
**A Bilingual Training Dataset for End-to-End LLM Training Pipeline Research**
> Curated by [OpenBB](https://huggingface.co/OpenBB) • Seed: 42 • License: Apache 2.0
---
## Overview
**OpenBB-215K** is a curated, reproducible training dataset designed to drive the complete four-phase LLM training pipeline — pretrain, SFT, alignment, and reasoning extension — in a single, self-contained package. It provides balanced Chinese–English coverage across ~215,000 samples drawn from 9 publicly available sources.
The dataset is intentionally compact (~1.2 GB) to enable full-pipeline experiments on a single GPU within hours, while preserving the diversity and quality characteristics of production-scale training data.
---
## Dataset Composition
| Phase | Source | Samples | Language |
|-------|--------|---------|----------|
| **Pretrain** | FineWeb-Edu-Chinese v2.1 | 50,000 | zh |
| | FineWeb-Edu sample-10BT | 50,000 | en |
| **SFT** | SmolTalk-Chinese | 10,978 | zh |
| | UltraChat-200K | 39,022 | en |
| **Align** | UltraFeedback Binarized | 50,000 | en |
| **Extend** | Opus 4.6 Reasoning (filtered) | 2,326 | en |
| | STEM-Reasoning-Complex | 5,000 | en+zh |
| | Gemini 3 Pro Reasoning | 5,000 | en |
| | Gemini 3.1 Pro Reasoning | 3,120 | en |
| **Total** | **9 sources** | **~215,446** | **zh + en** |
---
## File Structure
```
openbb-215k/
├── README.md
├── pretrain.jsonl # 100K lines (50K zh + 50K en, shuffled)
├── sft.jsonl # 50K lines (11K zh + 39K en, shuffled)
├── align.jsonl # 50K lines (chosen/rejected preference pairs)
└── extend.jsonl # 15.4K lines (Chain-of-Thought reasoning)
```
---
## Data Format
### pretrain.jsonl — Pretraining corpus
```json
{"text": "A passage of educational text in Chinese or English..."}
```
### sft.jsonl — Supervised fine-tuning conversations
```json
{"conversations": [{"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]}
```
### align.jsonl — Preference pairs for DPO/RLHF
```json
{"chosen": [{"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}],
"rejected": [{"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]}
```
### extend.jsonl — Chain-of-Thought reasoning
```json
{"instruction": "Problem statement...", "thinking": "Step-by-step reasoning...", "output": "Final answer..."}
```
| Field | Required | Description |
|-------|:--------:|-------------|
| `instruction` | ✅ | Problem or task description |
| `thinking` | ✅ | Chain-of-Thought reasoning process |
| `output` | ✅ | Final answer or solution |
---
## Sources & Attribution
All data is sourced from publicly available datasets on HuggingFace Hub.
| Source | HuggingFace ID | License |
|--------|---------------|---------|
| FineWeb-Edu-Chinese v2.1 | [`opencsg/Fineweb-Edu-Chinese-V2.1`](https://huggingface.co/datasets/opencsg/Fineweb-Edu-Chinese-V2.1) | Apache 2.0 |
| FineWeb-Edu sample-10BT | [`HuggingFaceFW/fineweb-edu`](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu) | ODC-By 1.0 |
| SmolTalk-Chinese | [`opencsg/Smoltalk-Chinese`](https://huggingface.co/datasets/opencsg/Smoltalk-Chinese) | Apache 2.0 |
| UltraChat-200K | [`HuggingFaceH4/ultrachat_200k`](https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k) | MIT |
| UltraFeedback Binarized | [`HuggingFaceH4/ultrafeedback_binarized`](https://huggingface.co/datasets/HuggingFaceH4/ultrafeedback_binarized) | MIT |
| Opus 4.6 Reasoning | [`nohurry/Opus-4.6-Reasoning-3000x-filtered`](https://huggingface.co/datasets/nohurry/Opus-4.6-Reasoning-3000x-filtered) | Community |
| STEM-Reasoning-Complex | [`galaxyMindAiLabs/stem-reasoning-complex`](https://huggingface.co/datasets/galaxyMindAiLabs/stem-reasoning-complex) | Community |
| Gemini 3 Pro Reasoning | [`Roman1111111/gemini-3-pro-10000x-hard-high-reasoning`](https://huggingface.co/datasets/Roman1111111/gemini-3-pro-10000x-hard-high-reasoning) | Community |
| Gemini 3.1 Pro Reasoning | [`Roman1111111/gemini-3.1-pro-hard-high-reasoning`](https://huggingface.co/datasets/Roman1111111/gemini-3.1-pro-hard-high-reasoning) | Community |
> Extend phase reasoning data is sourced from community-contributed synthetic datasets publicly available on HuggingFace Hub.
---
## Reproducibility
The dataset is fully reproducible from public sources:
```bash
python openbb/tools/build_openbb_215k.py
```
All sampling uses `random.seed(42)` and streams data directly from HuggingFace Hub. Running the build script on any machine will produce identical output.
---
## Quick Start
```bash
# Phase 1: Pretrain
bb train tier=nano stage=pretrain data_path=openbb/data/openbb-215k/pretrain.jsonl
# Phase 2: Supervised Fine-Tuning
bb train tier=nano stage=sft data_path=openbb/data/openbb-215k/sft.jsonl
# Phase 3: Preference Alignment (DPO)
bb train tier=nano stage=align method=dpo data_path=openbb/data/openbb-215k/align.jsonl
# Phase 4: Reasoning Extension (CoT)
bb train tier=nano stage=extend method=reason data_path=openbb/data/openbb-215k/extend.jsonl
```
---
## Citation
```bibtex
@misc{openbb215k,
title = {OpenBB-215K: A Bilingual Training Dataset for End-to-End LLM Pipeline Research},
author = {OpenBB Team},
year = {2026},
url = {https://huggingface.co/datasets/OpenBB/OpenBB-215K},
}
```
---
## License
This dataset is released under the [Apache License 2.0](https://www.apache.org/licenses/LICENSE-2.0). Individual source datasets retain their original licenses as listed in the attribution table above.
提供机构:
OpenBB



