DJLougen/drift-preview-5k
收藏Hugging Face2026-04-20 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/DJLougen/drift-preview-5k
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-4.0
tags:
- reasoning
- chain-of-thought
- curriculum-learning
- preview
- drift-diffusion
language:
- en
size_categories:
- 1K<n<10K
---
# Drift Preview 5K
## Overview
**Drift Preview 5K** is a public preview dataset of 5,000 high-quality reasoning samples, curated using evidence accumulation analysis and multi-factor quality scoring.
This dataset is released under CC-BY-4.0 for research and educational purposes. For the full proprietary dataset with complete annotations (per-step loss weights, DDM trajectories, premium composite scores), please contact for commercial licensing.
### Dataset at a Glance
| Property | Value |
|----------|-------|
| **Total Samples** | 5,000 |
| **Format** | JSON Lines |
| **License** | CC-BY 4.0 |
| **Mean Signal Score** | 79.5 |
| **Mean Difficulty** | 0.487 |
---
## Quality Tier Distribution
| Tier | Count | Percentage |
|------|-------|------------|
| Elite | 1,805 | 36.1% |
| Premium | 357 | 7.1% |
| Professional | 1,214 | 24.3% |
| Standard | 1,624 | 32.5% |
---
## Curriculum Bin Distribution
| Bin | Count | Description |
|-----|-------|-------------|
| High Quality | 152 | Strong evidence trajectory |
| Usable | 4,565 | Solid reasoning samples |
| Borderline | 283 | Near-boundary quality |
---
## Source Distribution
| Source | Count | Domain |
|--------|-------|--------|
| OpenMathInstruct2 | 2,392 | Mathematics |
| OpenCode | 1,970 | Programming |
| MagPie Pro | 612 | Conversational |
| SCoRe | 26 | Cognitive Science |
---
## Sample Structure
```json
{
"id": "sample_001",
"conversations": [
{"role": "user", "content": "..."},
{"role": "assistant", "content": "..."}
],
"problem": "...",
"solution": "...",
"difficulty": 0.65,
"signal_score": 72.5,
"source": "openmathinstruct2",
"quality_tier": "elite",
"curriculum_bin": "high_quality",
"has_reasoning_trajectory": true
}
```
---
## Usage
### Loading the Dataset
```python
from datasets import load_dataset
# Load the taster dataset
ds = load_dataset("DJLougen/ornstein-taster-5k", split="train")
```
### Training Example
```python
from transformers import AutoModelForCausalLM, Trainer, TrainingArguments
model = AutoModelForCausalLM.from_pretrained("your-base-model")
training_args = TrainingArguments(
output_dir="./results",
num_train_epochs=3,
per_device_train_batch_size=4,
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=ds,
)
trainer.train()
```
---
## Proprietary Version
This preview dataset is a subset of **Drift Proprietary 100K**, which includes:
- **82,507 samples** (full corpus)
- **Per-step loss weights** for curriculum learning
- **DDM evidence trajectories** with full reasoning paths
- **Multi-factor composite scores** (signal × length × self-correction × depth × verification)
- **Premium tier annotations** with loss weight multipliers
**For commercial licensing of the full dataset:**
- Contact: d.lougen@mail.utoronto.ca
- Repository: `DJLougen/ornstein-proprietary-100k` (private)
---
## Attribution
Derived from:
- [OpenMathInstruct-2](https://huggingface.co/datasets/nvidia/OpenMathInstruct-2) (NVIDIA)
- [OpenCodeReasoning](https://huggingface.co/datasets/nvidia/OpenCodeReasoning) (NVIDIA)
- [Magpie-Pro-300K-Filtered](https://huggingface.co/datasets/Magpie-Align/Magpie-Pro-300K-Filtered) (Magpie-Align)
- [SCoRe](https://huggingface.co/datasets/jon7009/SCoRe) (Structured Chain of Reasoning)
License: CC-BY 4.0
---
## Citation
```bibtex
@dataset{drift_preview_2025,
title={Drift Preview 5K: Curated Reasoning with Evidence Accumulation},
author={Daniel Lougen},
year={2025},
url={https://huggingface.co/datasets/DJLougen/drift-preview-5k},
publisher={Hugging Face},
license={CC-BY-4.0}
}
```
---
## Version History
- **v1.0** (2025-04-20): Initial public preview release
- 5,000 curated samples
- Stratified sampling across quality tiers
- Evidence accumulation methodology preview
---
*This dataset represents a preview of proprietary curation methodology. The full quality scoring framework and DDM analysis are available in the commercial version.*
提供机构:
DJLougen



