five

shreyvish5678/qwen3_5-a2d-stage1-sft

收藏
Hugging Face2026-03-08 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/shreyvish5678/qwen3_5-a2d-stage1-sft
下载链接
链接失效反馈
官方服务:
资源简介:
# Qwen3.5 A2D Stage 1 SFT Curated general supervised fine-tuning corpus for Qwen3.5 text-only A2D BD3LM experiments. ## Files - `train-*.parquet`: canonical training split data (54 shard(s)). - `stage1_sft_metadata.json`: curation counts and source-level metadata. - `metadata.json`: duplicate of the stage metadata for quick inspection. ## Loading ```python from datasets import load_dataset dataset = load_dataset("parquet", data_files={"train": "train-*.parquet"}) ``` ## Curation Details # Stage 1 SFT Dataset Canonical artifact: `/Users/shreyvishen/Projects/local_inference/auto-to-diff/datasets/stage1_sft.parquet` Training export: `/Users/shreyvishen/Projects/local_inference/auto-to-diff/datasets/stage1_sft_train.jsonl` Metadata: `/Users/shreyvishen/Projects/local_inference/auto-to-diff/datasets/stage1_sft_metadata.json` ## Stage 1 Filter Logic - Tulu-3 examples are scored for reasoning, coding, math, multi-turn structure, source affinity, and response length. - SmolTalk examples are scored for category balance, difficulty, quality, reward_model_score, conversation_tokens, and response length. - OpenThoughts examples are included in full after normalization into the shared messages schema. - Deduplication uses an exact normalized conversation fingerprint across all merged sources. - Parquet is the canonical artifact; JSONL is a derived training-ready export. ## Schema - `messages`: list of `{role, content, reasoning_content}` structs compatible with Qwen chat templating. - `source_dataset`, `source_split`, `source_id`, `source_name`: provenance fields. - `selection_bucket`, `category`, `difficulty`, `quality`, `reward_model_score`, `conversation_tokens`: curation metadata. - `num_turns`, `assistant_chars`, `reasoning_chars`, `selection_score`, `dedup_fingerprint`: filtering and audit metadata. ## Sources - allenai/tulu-3-sft-mixture - HuggingFaceTB/smoltalk (config: smol-magpie-ultra) - open-thoughts/OpenThoughts-114k
提供机构:
shreyvish5678
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作