five

tensorfiend/SimpleThoughts

收藏
Hugging Face2026-04-05 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/tensorfiend/SimpleThoughts
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - en license: cc-by-4.0 task_categories: - text-generation - question-answering tags: - thought-experiments - llm-training - pretraining - instruction-tuning - sft - alignment - dpo - reasoning - chain-of-thought - synthetic - science - physics - biology - logic - full-pipeline pretty_name: SimpleThoughts size_categories: - 100K<n<1M dataset_info: - config_name: alignment features: - name: prompt dtype: string - name: chosen dtype: string - name: rejected dtype: string - name: topic dtype: string - name: subtopic dtype: string - name: error_type dtype: string - name: judge_score dtype: float64 - name: judge_reasoning dtype: string - name: model_chosen dtype: string - name: model_rejected dtype: string - name: timestamp dtype: string splits: - name: train num_bytes: 12546003 num_examples: 7172 download_size: 6200282 dataset_size: 12546003 - config_name: pretrain features: - name: text dtype: string - name: topic dtype: string - name: subtopic dtype: string - name: concept dtype: string - name: model dtype: string - name: provider dtype: string - name: token_count dtype: int64 - name: timestamp dtype: string splits: - name: train num_bytes: 1393103131 num_examples: 352214 download_size: 759099376 dataset_size: 1393103131 - config_name: reasoning features: - name: input dtype: string - name: thought_trace dtype: string - name: output dtype: string - name: topic dtype: string - name: subtopic dtype: string - name: has_misconception dtype: bool - name: model dtype: string - name: provider dtype: string - name: timestamp dtype: string - name: thought_trace_compressed dtype: string - name: output_compressed dtype: string - name: compression_model dtype: string splits: - name: train num_bytes: 56320077 num_examples: 6300 download_size: 29428684 dataset_size: 56320077 - config_name: sft features: - name: messages list: - name: content dtype: string - name: role dtype: string - name: topic dtype: string - name: subtopic dtype: string - name: query_type dtype: string - name: model dtype: string - name: provider dtype: string - name: timestamp dtype: string splits: - name: train num_bytes: 25833050 num_examples: 25788 download_size: 23317582 dataset_size: 25833050 configs: - config_name: alignment data_files: - split: train path: alignment/train-* - config_name: pretrain data_files: - split: train path: pretrain/train-* - config_name: reasoning data_files: - split: train path: reasoning/train-* - config_name: sft data_files: - split: train path: sft/train-* --- # SimpleThoughts **A complete synthetic training corpus spanning all four LLM training stages — pretraining, supervised fine-tuning (SFT), preference alignment (DPO), and reasoning — built entirely around simple thought experiments.** SimpleThoughts is designed to train language models that can *think clearly about everyday phenomena* rather than just recall facts. Every sample is grounded in a concrete thought experiment: intuitive physics, causal inference, biology, economics, spatial reasoning, and more. ## Why SimpleThoughts? Most open training datasets are either stage-specific (pretrain-only or chat-only) or domain-narrow. SimpleThoughts is different: - **Full pipeline coverage** - one dataset repo for all four training stages - **Conceptually coherent** - all data is grounded in the same thought-experiment taxonomy, so models trained end-to-end develop a consistent reasoning style - **Richly annotated** - every sample carries topic, subtopic, model provenance, and stage-specific metadata - **High quality** - generated by frontier models (DeepSeek V3, Qwen3 32B, Llama 3.3 70B, Mistral Small 3.2) with judge scoring on alignment pairs - **Experimental** - Relatively smaller dataset to help early researchers learn end-to-end LLM training pipeline. ## Dataset Summary | Config | Stage | Samples | Format | |---|---|---|---| | `pretrain` | Pretraining | 352,214 | Free-form text | | `sft` | Supervised Fine-Tuning | 25,788 | chat (messages list) | | `alignment` | Preference Alignment (DPO) | 7,172 | Chosen / rejected pairs | | `reasoning` | Reasoning | 6,300 | Input + think trace + output | | **Total** | | **391,474** | | ## Quick Start ```python from datasets import load_dataset # Load a specific training stage pretrain_ds = load_dataset("tensorfiend/SimpleThoughts", "pretrain") sft_ds = load_dataset("tensorfiend/SimpleThoughts", "sft") alignment_ds = load_dataset("tensorfiend/SimpleThoughts", "alignment") reasoning_ds = load_dataset("tensorfiend/SimpleThoughts", "reasoning") ``` ## Configs ### Pretraining — 352,214 samples Free-form expository text on 44 STEM and conceptual topics, suitable for causal language model pretraining. ```json { "text": "Surface tension arises because water molecules at the surface ...", "topic": "intuitive_physics", "subtopic": "surface_forces", "concept": "surface tension", "model": "deepseek-chat", "provider": "deepseek", "token_count": 312, "timestamp": "2025-01-14T10:22:31" } ``` Topics covered (44): intuitive_physics, biology_life, chemistry_matter, energy_thermodynamics, economics_game_theory, artificial_intelligence, geometry_space, human_body, earth_science, electricity_magnetism, and 34 more. Generation mix: DeepSeek V3 (70%) + Llama-3.3-70B (30%) ### SFT — 25,788 samples Instruction-following data in multi-turn chat format (ChatML-style messages list). Three question types — counterfactual, explanatory, and predictive — across 18 conceptual domains. ```json { "messages": [ {"role": "user", "content": "If you removed all the air from a sealed room, what would happen to a lit candle?"}, {"role": "assistant", "content": "The candle would go out almost immediately ..."} ], "topic": "intuitive_physics", "subtopic": "combustion", "query_type": "counterfactual", "model": "qwen3-32b", "provider": "hyperbolic", "timestamp": "2025-02-01T08:14:05" } ``` Query types: counterfactual · explanatory · predictive Topics covered (18): intuitive_physics, logic_causal_inference, theory_of_mind, spatial_reasoning, biology_life, chemistry_matter, economics_game_theory, human_body, and more. Generation mix: Qwen3 32B (50%) + Mistral Small 3.2 24B (50%) ### Alignment — 7,172 samples Preference pairs (chosen / rejected) for DPO / RLHF training. Rejected responses contain one of three annotated error types. Each pair is judge-scored by a frontier model with a reasoning trace. ```json { "prompt": "Why does correlation not imply causation?", "chosen": "Correlation means two variables move together, but ...", "rejected": "If two things always happen together, one must cause the other ...", "topic": "logic_causal_inference", "subtopic": "statistical_reasoning", "error_type": "correlation_causation", "judge_score": 4, "judge_reasoning": "The chosen response correctly distinguishes ...", "model_chosen": "qwen3-32b", "model_rejected": "qwen3-32b", "timestamp": "2025-02-20T15:43:11" } ``` Error types in rejected responses: - correlation_causation — confuses correlation with causal relationships - teleological — attributes purpose or intent to natural processes - imprecise_metaphor — uses analogies that subtly mislead ### Reasoning — 6,300 samples Step-by-step chain-of-thought data with explicit <think>...</think> traces, compressed variants, and final answers. Ideal for reasoning fine-tuning and distillation. ```json { "input": "A bat and a ball together cost $1.10. The bat costs $1 more than the ball. How much does the ball cost?", "thought_trace": "<think>\nLet the ball cost x cents...\n</think>", "output": "The ball costs $0.05.", "topic": "logic_causal_inference", "subtopic": "algebraic_reasoning", "has_misconception": true, "model": "qwen3-32b", "provider": "hyperbolic", "thought_trace_compressed": "Thought trace compressed version for smaller models", "output_compressed": "Output compressed for smaller models", "compression_model": "qwen3-32b", "timestamp": "2025-03-10T09:05:22" } ``` Topics covered (8): logic_causal_inference, intuitive_physics, spatial_reasoning, theory_of_mind, economics_game_theory, biological_logic, material_science, systems_theory ## Intended Uses - Full-pipeline LLM training — train a model from scratch through all four stages using a single coherent dataset - Stage-specific fine-tuning — use any individual config independently - Reasoning research — the reasoning config with compressed thought traces is useful for studying chain-of-thought distillation - Alignment research — the alignment config has typed error categories useful for studying failure modes in preference learning - Benchmarking — thought experiment questions are a natural test of intuitive reasoning beyond surface-level recall ## Related - Model trained on this dataset: https://huggingface.co/tensorfiend/DotLM (coming soon) - Training framework: https://github.com/shanmukh05/DotLM (coming soon) ## Citation @dataset{simplethoughts2026, author = {Shanmukh}, title = {SimpleThoughts: A Full-Pipeline Thought Experiment Training Corpus}, year = {2026}, publisher = {Hugging Face}, url = {https://huggingface.co/datasets/tensorfiend/SimpleThoughts} } ## License https://creativecommons.org/licenses/by/4.0/
提供机构:
tensorfiend
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作