five

DJLougen/drift-preview-5k

收藏
Hugging Face2026-04-20 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/DJLougen/drift-preview-5k
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-4.0 tags: - reasoning - chain-of-thought - curriculum-learning - preview - drift-diffusion language: - en size_categories: - 1K<n<10K --- # Drift Preview 5K ## Overview **Drift Preview 5K** is a public preview dataset of 5,000 high-quality reasoning samples, curated using evidence accumulation analysis and multi-factor quality scoring. This dataset is released under CC-BY-4.0 for research and educational purposes. For the full proprietary dataset with complete annotations (per-step loss weights, DDM trajectories, premium composite scores), please contact for commercial licensing. ### Dataset at a Glance | Property | Value | |----------|-------| | **Total Samples** | 5,000 | | **Format** | JSON Lines | | **License** | CC-BY 4.0 | | **Mean Signal Score** | 79.5 | | **Mean Difficulty** | 0.487 | --- ## Quality Tier Distribution | Tier | Count | Percentage | |------|-------|------------| | Elite | 1,805 | 36.1% | | Premium | 357 | 7.1% | | Professional | 1,214 | 24.3% | | Standard | 1,624 | 32.5% | --- ## Curriculum Bin Distribution | Bin | Count | Description | |-----|-------|-------------| | High Quality | 152 | Strong evidence trajectory | | Usable | 4,565 | Solid reasoning samples | | Borderline | 283 | Near-boundary quality | --- ## Source Distribution | Source | Count | Domain | |--------|-------|--------| | OpenMathInstruct2 | 2,392 | Mathematics | | OpenCode | 1,970 | Programming | | MagPie Pro | 612 | Conversational | | SCoRe | 26 | Cognitive Science | --- ## Sample Structure ```json { "id": "sample_001", "conversations": [ {"role": "user", "content": "..."}, {"role": "assistant", "content": "..."} ], "problem": "...", "solution": "...", "difficulty": 0.65, "signal_score": 72.5, "source": "openmathinstruct2", "quality_tier": "elite", "curriculum_bin": "high_quality", "has_reasoning_trajectory": true } ``` --- ## Usage ### Loading the Dataset ```python from datasets import load_dataset # Load the taster dataset ds = load_dataset("DJLougen/ornstein-taster-5k", split="train") ``` ### Training Example ```python from transformers import AutoModelForCausalLM, Trainer, TrainingArguments model = AutoModelForCausalLM.from_pretrained("your-base-model") training_args = TrainingArguments( output_dir="./results", num_train_epochs=3, per_device_train_batch_size=4, ) trainer = Trainer( model=model, args=training_args, train_dataset=ds, ) trainer.train() ``` --- ## Proprietary Version This preview dataset is a subset of **Drift Proprietary 100K**, which includes: - **82,507 samples** (full corpus) - **Per-step loss weights** for curriculum learning - **DDM evidence trajectories** with full reasoning paths - **Multi-factor composite scores** (signal × length × self-correction × depth × verification) - **Premium tier annotations** with loss weight multipliers **For commercial licensing of the full dataset:** - Contact: d.lougen@mail.utoronto.ca - Repository: `DJLougen/ornstein-proprietary-100k` (private) --- ## Attribution Derived from: - [OpenMathInstruct-2](https://huggingface.co/datasets/nvidia/OpenMathInstruct-2) (NVIDIA) - [OpenCodeReasoning](https://huggingface.co/datasets/nvidia/OpenCodeReasoning) (NVIDIA) - [Magpie-Pro-300K-Filtered](https://huggingface.co/datasets/Magpie-Align/Magpie-Pro-300K-Filtered) (Magpie-Align) - [SCoRe](https://huggingface.co/datasets/jon7009/SCoRe) (Structured Chain of Reasoning) License: CC-BY 4.0 --- ## Citation ```bibtex @dataset{drift_preview_2025, title={Drift Preview 5K: Curated Reasoning with Evidence Accumulation}, author={Daniel Lougen}, year={2025}, url={https://huggingface.co/datasets/DJLougen/drift-preview-5k}, publisher={Hugging Face}, license={CC-BY-4.0} } ``` --- ## Version History - **v1.0** (2025-04-20): Initial public preview release - 5,000 curated samples - Stratified sampling across quality tiers - Evidence accumulation methodology preview --- *This dataset represents a preview of proprietary curation methodology. The full quality scoring framework and DDM analysis are available in the commercial version.*
提供机构:
DJLougen
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作