shreyvish5678/qwen3_5-a2d-stage1-sft
收藏Hugging Face2026-03-08 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/shreyvish5678/qwen3_5-a2d-stage1-sft
下载链接
链接失效反馈官方服务:
资源简介:
# Qwen3.5 A2D Stage 1 SFT
Curated general supervised fine-tuning corpus for Qwen3.5 text-only A2D BD3LM experiments.
## Files
- `train-*.parquet`: canonical training split data (54 shard(s)).
- `stage1_sft_metadata.json`: curation counts and source-level metadata.
- `metadata.json`: duplicate of the stage metadata for quick inspection.
## Loading
```python
from datasets import load_dataset
dataset = load_dataset("parquet", data_files={"train": "train-*.parquet"})
```
## Curation Details
# Stage 1 SFT Dataset
Canonical artifact: `/Users/shreyvishen/Projects/local_inference/auto-to-diff/datasets/stage1_sft.parquet`
Training export: `/Users/shreyvishen/Projects/local_inference/auto-to-diff/datasets/stage1_sft_train.jsonl`
Metadata: `/Users/shreyvishen/Projects/local_inference/auto-to-diff/datasets/stage1_sft_metadata.json`
## Stage 1 Filter Logic
- Tulu-3 examples are scored for reasoning, coding, math, multi-turn structure, source affinity, and response length.
- SmolTalk examples are scored for category balance, difficulty, quality, reward_model_score, conversation_tokens, and response length.
- OpenThoughts examples are included in full after normalization into the shared messages schema.
- Deduplication uses an exact normalized conversation fingerprint across all merged sources.
- Parquet is the canonical artifact; JSONL is a derived training-ready export.
## Schema
- `messages`: list of `{role, content, reasoning_content}` structs compatible with Qwen chat templating.
- `source_dataset`, `source_split`, `source_id`, `source_name`: provenance fields.
- `selection_bucket`, `category`, `difficulty`, `quality`, `reward_model_score`, `conversation_tokens`: curation metadata.
- `num_turns`, `assistant_chars`, `reasoning_chars`, `selection_score`, `dedup_fingerprint`: filtering and audit metadata.
## Sources
- allenai/tulu-3-sft-mixture
- HuggingFaceTB/smoltalk (config: smol-magpie-ultra)
- open-thoughts/OpenThoughts-114k
提供机构:
shreyvish5678



