YouAIData/stem-reasoning-v1.0.0-ccbysa-002
收藏Hugging Face2026-04-20 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/YouAIData/stem-reasoning-v1.0.0-ccbysa-002
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-sa-4.0
language:
- en
tags:
- stem
- reasoning
- chain-of-thought
- mathematics
- physics
- science
- instruction-tuning
- dpo
- rlvr
- sharegpt
- alpaca
- chatml
- preference
- youai
size_categories:
- 1K<n<10K
task_categories:
- text-generation
- question-answering
pretty_name: "YouAI Data — stem-reasoning-v1.0.0-ccbysa-002"
---
# YouAI Data — stem-reasoning-v1.0.0-ccbysa-002
## Dataset Description
**YouAI Data — 1,000 STEM training examples** extracted from verified CC-BY-SA expert sources — real domain experts solving real problems, not synthetic LLM generation. Contains 239 step-by-step reasoning chains and 723 instruction/response pairs across 443 unique sources. Every example traces to a source URL, available source metadata, and verified license. Includes 476 DPO preference pairs as a free companion dataset. Includes 517 raw reasoning examples with explicit answer labels suitable for RLVR; the remaining rows are exported in chat/SFT formats only.
## Dataset Summary
- **Size:** 1,000 examples
- **Domain:** STEM
- **Formats:** ShareGPT, Alpaca, ChatML, DPO Pairs, Raw Reasoning (RLVR-ready rows with explicit answer labels) — JSONL, with matching Parquet for non-empty shards
- **License:** CC-BY-SA-4.0
- **Created on:** 2026-04-11 19:47:08 UTC
## Quality Metrics
- **Mean quality score:** 4.39 / 5.0
- **Min quality score:** 4.3
- **Max quality score:** 4.46
- **Verification:** YouAI multi-stage verification on every example
## Diversity Metrics
- **Unique sources:** 443
- **Unique concepts tagged:** 3,234
- **Difficulty:** introductory: 390 (39%) · intermediate: 314 (31%) · advanced: 296 (30%)
- **No single source exceeds 15% of the dataset**
## What's Included
- **5 export variants:** ShareGPT · Alpaca · ChatML · DPO Pairs · Raw Reasoning
- **DPO preference pairs:** 476 pairs; best-scoring = chosen, lowest-scoring = rejected
- **Raw reasoning rows:** 517 rows with explicit answer labels suitable for RLVR
- **Curriculum splits:** `train-introductory`, `train-intermediate`, `train-advanced`
- **Sidecar provenance:** Full metadata per example in `sidecar/dataclaw_meta.jsonl`
## Verification & Quality Assurance
Every example in this dataset passes a multi-stage review process developed by YouAI:
1. **Automated filtering** — Raw content is scored and filtered for information density, coherence, and domain relevance. Typically 30–40% of raw material is discarded at this stage.
2. **Multi-answer validation** — Multiple independent answers are generated per question. Examples are only retained when answers reach a minimum agreement threshold; conflicting answers are discarded.
3. **Computational verification** — Where applicable (mathematics, code), answers are verified programmatically against ground truth.
4. **YouAI expert review** — A structured human expert review pass is applied to sampled examples across every domain, checking for factual accuracy, logical soundness, and format correctness.
5. **4-signal quality scoring** — Each example carries a composite quality score derived from multiple independent signals. Only examples meeting the minimum threshold are included.
The specific methods, tooling, and thresholds used in our verification pipeline are proprietary to YouAI.
## Decontamination
Decontaminated against GSM8K, MATH, MMLU, HumanEval, MBPP, ARC-Challenge.
Method: 13-gram overlap detection. Zero benchmark test-set overlap.
## Source & Provenance
Every example traces to a specific source with a full audit trail:
- Source URL, title, channel, and license — in `provenance.json` (per-source summary with timestamp range)
- Per-example source metadata — in `sidecar/dataclaw_meta.jsonl` under `source.*`
- Available timestamp range (`source.timestamp_start`, `source.timestamp_end` in seconds) — in `sidecar/dataclaw_meta.jsonl`
- Quality score breakdown per signal (`quality.breakdown`) — in `sidecar/dataclaw_meta.jsonl`
All source material is verified as commercially licensable prior to inclusion. No sources with NonCommercial (NC) or NoDerivatives (ND) restrictions are used.
## Data Collection
YouAI Data uses a proprietary autonomous pipeline for discovery, extraction, and quality assurance. The specific discovery methods, sources, and tooling are not disclosed. What we can share:
- All content originates from openly licensed or public domain sources
- Every example is traceable to its original source URL and license
- The pipeline is designed for domain-expert content — university lectures, technical documentation, professional proceedings — not synthetic generation
## How to Use
```python
from datasets import load_dataset
# Load default (ShareGPT, train split)
ds = load_dataset("YouAIData/stem-reasoning-v1.0.0-ccbysa-002", split="train")
# Load specific format
ds_alpaca = load_dataset("YouAIData/stem-reasoning-v1.0.0-ccbysa-002", data_files="data/formats/alpaca.parquet")
# Load DPO preference pairs (when present)
ds_dpo = load_dataset("YouAIData/stem-reasoning-v1.0.0-ccbysa-002", data_files="data/dpo/train-preferences.parquet")
# Load curriculum split
ds_hard = load_dataset("YouAIData/stem-reasoning-v1.0.0-ccbysa-002", data_files="data/curriculum/train-advanced.parquet")
```
## Limitations & Biases
English only (v1). Content skews toward university-level formal register. Speaker demographics reflect availability of open-license expert sources.
## Example
A real row sampled from this dataset (ShareGPT format):
```json
{
"conversations": [
{
"from": "human",
"value": "Given a disjoint-set data structure using union by size, where each node stores its size (number of descendants including itself), and the Union operation merges two trees by making the root with more descendants the parent of the root with fewer descendants, calculate the maximum height of the tree after performing n-1 union operations on n initially separate elements."
},
{
"from": "gpt",
"value": "**Step 1:** The problem involves a disjoint-set data structure initialized with n separate elements, each as a singleton tree.\n\n**Step 2:** The Union operation merges two trees by making the root with more descendants the parent of the root with fewer descendants, which is known as union by size.\n\n**Step 3:** When performing n-1 union operations on n elements, the resulting structure is a single tree.\n\n**Step 4:** The height of the tree is determined by the sequence of unions and the union by size strategy.\n\n**Step 5:** The worst-case scenario for height occurs when the tree is built in a way that maximizes depth, but union by size prevents this by always attaching the smaller tree to the larger one.\n\n**Step 6:** The maximum height of a tree built using union by size is logarithmic in the number of elements, specifically O(log n).\n\n**Step 7:** This is because each time a node is added to a tree, it is added to a tree at least as large as itself, so the size of the tree at least doubles with each level of depth.\n\n\n**Answer:** O(log n)"
}
]
}
```
## Citation
```bibtex
@misc{youaidata_stem_reasoning_v1.0.0_ccbysa_002,
title={{stem-reasoning-v1.0.0-ccbysa-002}},
author={{YouAI Data}},
year={{2026}},
url={{https://huggingface.co/datasets/YouAIData/stem-reasoning-v1.0.0-ccbysa-002}}
}
```
## Training Results — Demonstrated Improvement
We fine-tuned **Qwen3.5-4B (Instruct)** using our STEM 001 10K dataset (~10K examples + ~4K DPO pairs) with LoRA to measure the impact on reasoning benchmarks. The results speak for themselves:

| Benchmark | Baseline | + SFT | + SFT + DPO | Improvement |
|-----------|----------|-------|-------------|-------------|
| **GSM8K** | 70.5% | 86.0% | **87.5%** | **+17.0%** |
| **MATH-500** | 28.0% | 50.0% | **54.0%** | **+26.0%** |
| **ARC-Challenge** | 88.5% | 87.5% | **89.5%** | **+1.0%** |
**Training configuration:**
- **Method:** LoRA (r=32, alpha=64) — base model weights frozen
- **SFT:** 1 epoch, lr=5e-5, batch=2, grad_accum=8, packing enabled
- **DPO:** 200 steps, lr=2e-5, beta=0.1, implicit PEFT reference model
- **Hardware:** Single RTX 4090 (24GB), ~2h SFT + ~2.5h DPO
- **Data:** ~10K SFT examples (ShareGPT format) + ~4K DPO preference pairs from our STEM Reasoning v1 dataset
Key takeaways:
- **MATH-500 nearly doubled** (+26%) — our multi-step reasoning chains directly transfer to competition-level math
- **GSM8K jumped +17%** — grade-school math improved dramatically with real expert reasoning examples
- **ARC held steady** — no catastrophic forgetting on science reasoning; DPO even recovered the slight SFT dip
- **DPO adds value on top of SFT** — every benchmark improved or held from SFT to SFT+DPO
These results demonstrate that a single, carefully curated 10K-example dataset from real expert reasoning outperforms what many achieve with 100K+ synthetic examples.
## Commercial Licensing & Exclusive Datasets
Interested in purchasing our full 10K+ CC-BY datasets with exclusive licensing rights? We offer:
- **Standard datasets** — Full verified datasets in 5 formats (ShareGPT, Alpaca, ChatML, DPO, Raw Reasoning rows with explicit answer labels)
- **Exclusive licensing** — 90-day exclusivity on specific domain datasets
- **Custom datasets** — Tailored to your domain, format, and volume requirements
**Contact us:**
- **HuggingFace:** [YouAIData](https://huggingface.co/YouAIData)
- **X / Twitter:** [@YouAIDATA](https://x.com/YouAIDATA)
提供机构:
YouAIData



