OpenBB/OpenBB-215K

Name: OpenBB/OpenBB-215K
Creator: OpenBB
Published: 2026-02-26 11:21:40
License: 暂无描述

Hugging Face2026-02-26 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/OpenBB/OpenBB-215K

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - zh - en license: apache-2.0 task_categories: - text-generation - question-answering size_categories: - 100K<n<1M tags: - llm-training - pretrain - sft - dpo - reasoning - chain-of-thought - bilingual pretty_name: OpenBB-215K --- # OpenBB-215K v0.1 **A Bilingual Training Dataset for End-to-End LLM Training Pipeline Research** > Curated by [OpenBB](https://huggingface.co/OpenBB) • Seed: 42 • License: Apache 2.0 --- ## Overview **OpenBB-215K** is a curated, reproducible training dataset designed to drive the complete four-phase LLM training pipeline — pretrain, SFT, alignment, and reasoning extension — in a single, self-contained package. It provides balanced Chinese–English coverage across ~215,000 samples drawn from 9 publicly available sources. The dataset is intentionally compact (~1.2 GB) to enable full-pipeline experiments on a single GPU within hours, while preserving the diversity and quality characteristics of production-scale training data. --- ## Dataset Composition | Phase | Source | Samples | Language | |-------|--------|---------|----------| | **Pretrain** | FineWeb-Edu-Chinese v2.1 | 50,000 | zh | | | FineWeb-Edu sample-10BT | 50,000 | en | | **SFT** | SmolTalk-Chinese | 10,978 | zh | | | UltraChat-200K | 39,022 | en | | **Align** | UltraFeedback Binarized | 50,000 | en | | **Extend** | Opus 4.6 Reasoning (filtered) | 2,326 | en | | | STEM-Reasoning-Complex | 5,000 | en+zh | | | Gemini 3 Pro Reasoning | 5,000 | en | | | Gemini 3.1 Pro Reasoning | 3,120 | en | | **Total** | **9 sources** | **~215,446** | **zh + en** | --- ## File Structure ``` openbb-215k/ ├── README.md ├── pretrain.jsonl # 100K lines (50K zh + 50K en, shuffled) ├── sft.jsonl # 50K lines (11K zh + 39K en, shuffled) ├── align.jsonl # 50K lines (chosen/rejected preference pairs) └── extend.jsonl # 15.4K lines (Chain-of-Thought reasoning) ``` --- ## Data Format ### pretrain.jsonl — Pretraining corpus ```json {"text": "A passage of educational text in Chinese or English..."} ``` ### sft.jsonl — Supervised fine-tuning conversations ```json {"conversations": [{"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]} ``` ### align.jsonl — Preference pairs for DPO/RLHF ```json {"chosen": [{"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}], "rejected": [{"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]} ``` ### extend.jsonl — Chain-of-Thought reasoning ```json {"instruction": "Problem statement...", "thinking": "Step-by-step reasoning...", "output": "Final answer..."} ``` | Field | Required | Description | |-------|:--------:|-------------| | `instruction` | ✅ | Problem or task description | | `thinking` | ✅ | Chain-of-Thought reasoning process | | `output` | ✅ | Final answer or solution | --- ## Sources & Attribution All data is sourced from publicly available datasets on HuggingFace Hub. | Source | HuggingFace ID | License | |--------|---------------|---------| | FineWeb-Edu-Chinese v2.1 | [`opencsg/Fineweb-Edu-Chinese-V2.1`](https://huggingface.co/datasets/opencsg/Fineweb-Edu-Chinese-V2.1) | Apache 2.0 | | FineWeb-Edu sample-10BT | [`HuggingFaceFW/fineweb-edu`](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu) | ODC-By 1.0 | | SmolTalk-Chinese | [`opencsg/Smoltalk-Chinese`](https://huggingface.co/datasets/opencsg/Smoltalk-Chinese) | Apache 2.0 | | UltraChat-200K | [`HuggingFaceH4/ultrachat_200k`](https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k) | MIT | | UltraFeedback Binarized | [`HuggingFaceH4/ultrafeedback_binarized`](https://huggingface.co/datasets/HuggingFaceH4/ultrafeedback_binarized) | MIT | | Opus 4.6 Reasoning | [`nohurry/Opus-4.6-Reasoning-3000x-filtered`](https://huggingface.co/datasets/nohurry/Opus-4.6-Reasoning-3000x-filtered) | Community | | STEM-Reasoning-Complex | [`galaxyMindAiLabs/stem-reasoning-complex`](https://huggingface.co/datasets/galaxyMindAiLabs/stem-reasoning-complex) | Community | | Gemini 3 Pro Reasoning | [`Roman1111111/gemini-3-pro-10000x-hard-high-reasoning`](https://huggingface.co/datasets/Roman1111111/gemini-3-pro-10000x-hard-high-reasoning) | Community | | Gemini 3.1 Pro Reasoning | [`Roman1111111/gemini-3.1-pro-hard-high-reasoning`](https://huggingface.co/datasets/Roman1111111/gemini-3.1-pro-hard-high-reasoning) | Community | > Extend phase reasoning data is sourced from community-contributed synthetic datasets publicly available on HuggingFace Hub. --- ## Reproducibility The dataset is fully reproducible from public sources: ```bash python openbb/tools/build_openbb_215k.py ``` All sampling uses `random.seed(42)` and streams data directly from HuggingFace Hub. Running the build script on any machine will produce identical output. --- ## Quick Start ```bash # Phase 1: Pretrain bb train tier=nano stage=pretrain data_path=openbb/data/openbb-215k/pretrain.jsonl # Phase 2: Supervised Fine-Tuning bb train tier=nano stage=sft data_path=openbb/data/openbb-215k/sft.jsonl # Phase 3: Preference Alignment (DPO) bb train tier=nano stage=align method=dpo data_path=openbb/data/openbb-215k/align.jsonl # Phase 4: Reasoning Extension (CoT) bb train tier=nano stage=extend method=reason data_path=openbb/data/openbb-215k/extend.jsonl ``` --- ## Citation ```bibtex @misc{openbb215k, title = {OpenBB-215K: A Bilingual Training Dataset for End-to-End LLM Pipeline Research}, author = {OpenBB Team}, year = {2026}, url = {https://huggingface.co/datasets/OpenBB/OpenBB-215K}, } ``` --- ## License This dataset is released under the [Apache License 2.0](https://www.apache.org/licenses/LICENSE-2.0). Individual source datasets retain their original licenses as listed in the attribution table above.

提供机构：

OpenBB

5,000+

优质数据集

54 个

任务类型

进入经典数据集