five

nlile/TeichAI-curated-sft-39k

收藏
Hugging Face2025-12-11 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/nlile/TeichAI-curated-sft-39k
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: features: - name: messages list: - name: role dtype: string - name: content dtype: string - name: user_prompt dtype: string - name: assistant_response dtype: string - name: source_table dtype: string - name: model_name dtype: string - name: prompt_length dtype: int64 - name: response_length dtype: int64 splits: - name: train num_bytes: 1177205486 num_examples: 39463 download_size: 580059070 dataset_size: 1177205486 configs: - config_name: default data_files: - split: train path: data/train-* --- # TeichAI Unified Merge (Curated) A curated, deduplicated, and quality-validated merge of 21 TeichAI supervised fine-tuning datasets featuring responses from 18 frontier reasoning models. ## Attribution **Original datasets by [TeichAI](https://huggingface.co/TeichAI)** - This is a cleaned and unified version of their individual model response datasets. All credit for data collection and generation goes to TeichAI. **Curation by [nlile](https://huggingface.co/nlile)** - Quality validation, deduplication, schema unification, and extended analysis. ## Dataset Summary | Metric | Value | |--------|-------| | Total rows | 39,463 | | Unique prompts | 20,443 | | Unique models | 18 | | Source tables | 21 | | Avg response length | 14,452 chars | | Responses with `<think>` tags | 63.6% | ## What Changed From Original ### Filtering Applied - **508 empty responses removed** - Validated as generation failures, not safety refusals - **252 duplicate rollouts removed** - Exact (prompt, response, model) matches - 248 from `gemini-3-pro-preview` (250x was exact subset of 1000x) - 4 internal duplicates from `gpt-5-codex-250x` ### Schema Enhancements Original schema: `messages STRUCT[]` only New columns added: | Column | Type | Description | |--------|------|-------------| | `user_prompt` | VARCHAR | Extracted from `messages[2].content` | | `assistant_response` | VARCHAR | Extracted from `messages[3].content` | | `source_table` | VARCHAR | Original TeichAI dataset name | | `model_name` | VARCHAR | Inferred generating model | | `prompt_length` | INTEGER | Character count of prompt | | `response_length` | INTEGER | Character count of response | ## Models Included | Model | Rows | Avg Response | Code% | Think Tags | |-------|------|--------------|-------|------------| | gemini-2.5-flash | 11,086 | 16.9K | 17% | 99% | | sherlock-thinking-alpha | 11,080 | 5.2K | 36% | 0% | | deepseek-v3.2-openr1-math | 3,317 | 20.7K | 1% | 100% | | deepseek-v3.2-OpenCodeReasoning | 2,562 | 63.5K | 75% | 100% | | gpt-5-codex | 1,236 | 5.8K | 38% | 100% | | kimi-k2-thinking | 1,181 | 15.3K | 49% | 100% | | gemini-3-pro-preview | 1,018 | 7.3K | 47% | 100% | | grok-code-fast-1 | 1,017 | 4.7K | 47% | 100% | | sherlock-think-alpha | 1,017 | 4.4K | 49% | 0% | | gpt-5.1 | 1,017 | 7.5K | 46% | 100% | | polaris-alpha | 1,017 | 4.2K | 26% | 0% | | sherlock-dash-alpha | 1,014 | 5.7K | 53% | 0% | | gemini-2.5-flash-lite | 991 | 8.1K | 44% | 100% | | deepseek-v3.2-speciale | 976 | 14.8K | 45% | 100% | | claude-4.5-opus | 250 | 30.1K | 49% | 100% | | claude-sonnet-4.5 | 247 | 5.1K | 35% | 100% | | glm-4.6 | 245 | 11.6K | 39% | 100% | | grok-4-fast | 192 | 3.9K | 1% | 0% | ## Domain Distribution | Domain | Count | % | |--------|-------|---| | General/Other | 25,159 | 63.8% | | Coding | 6,291 | 15.9% | | Math/Reasoning | 5,800 | 14.7% | | Creative | 1,218 | 3.1% | | Philosophy | 995 | 2.5% | ## Programming Languages in Code Blocks | Language | Count | % of Dataset | |----------|-------|--------------| | Python | 5,243 | 13.3% | | JavaScript | 1,734 | 4.4% | | HTML | 684 | 1.7% | | C++ | 654 | 1.7% | | Bash/Shell | 370 | 0.9% | | TypeScript | 287 | 0.7% | | SQL | 123 | 0.3% | | Rust | 54 | 0.1% | | Go | 50 | 0.1% | ## Quality Validation Performed | Check | Result | |-------|--------| | PII (emails, phones, SSNs) | ✅ Zero detected | | Duplicate rollouts | ✅ Removed 252 | | Empty responses | ✅ Removed 508 | | Near-duplicates | ✅ Only valid math answers | | Boilerplate phrases | ✅ Max 0.55% repetition | | Unicode/encoding issues | ✅ Clean (0.01% CR, 0.23% HTML entities) | | Null bytes | ✅ Zero | | Unclosed code blocks | ⚠️ 118 rows (0.3%) | ## Prompt Overlap Analysis Many prompts were intentionally sent to multiple models for comparison: | Prompts in N Models | Count | % | |--------------------|-------|---| | 1 model only | 12,156 | 59.5% | | 2 models | 7,128 | 34.9% | | 3-5 models | 66 | 0.3% | | 6-8 models | 210 | 1.0% | | 10-15 models | 883 | 4.3% | ## Response Length Distribution | Bucket | Count | % | |--------|-------|---| | < 100 chars | 146 | 0.4% | | 100-500 | 815 | 2.1% | | 500-1K | 990 | 2.5% | | 1K-5K | 9,481 | 24.0% | | 5K-10K | 11,592 | 29.4% | | 10K-50K | 14,777 | 37.4% | | 50K-100K | 1,094 | 2.8% | | > 100K | 568 | 1.4% | ## Source Datasets All 21 original TeichAI datasets merged: ``` TeichAI_brainstorm-v3.1-grok-4-fast-200x TeichAI_claude-4.5-opus-250x TeichAI_claude-sonnet-4.5-250x TeichAI_deepseek-v3.2-openr1-math-3200x TeichAI_deepseek-v3.2-speciale-1000x TeichAI_deepseek-v3.2-speciale-OpenCodeReasoning-3k TeichAI_gemini-2.5-flash-11000x TeichAI_gemini-2.5-flash-lite-1000x TeichAI_gemini-3-pro-preview-high-reasoning-1000x TeichAI_gemini-3-pro-preview-high-reasoning-250x TeichAI_glm-4.6-250x TeichAI_gpt-5-codex-1000x TeichAI_gpt-5-codex-250x TeichAI_gpt-5.1-1000x TeichAI_grok-code-fast-1-1000x TeichAI_kimi-k2-thinking-1000x TeichAI_kimi-k2-thinking-250x TeichAI_polaris-alpha-1000x TeichAI_sherlock-dash-alpha-1000x TeichAI_sherlock-think-alpha-1000x TeichAI_sherlock-thinking-alpha-11000x ``` ## Notes - **Thinking tags**: ~64% of responses contain `<think>...</think>` reasoning blocks. This is intentional for reasoning-focused models. - **Emoji endings**: `sherlock-thinking-alpha` responses often end with emojis (🕵️‍♂️, 🚀, 🔍) - stylistic choice, not truncation. - **Empty system messages**: All rows have empty system messages in the original `messages` struct (consistent across all sources).

数据集信息: 特征: - 名称:messages 列表: - 名称:role 数据类型:字符串 - 名称:content 数据类型:字符串 - 名称:user_prompt 数据类型:字符串 - 名称:assistant_response 数据类型:字符串 - 名称:source_table 数据类型:字符串 - 名称:model_name 数据类型:字符串 - 名称:prompt_length 数据类型:64位整数 - 名称:response_length 数据类型:64位整数 划分: - 名称:train 字节数:1177205486 样本数:39463 下载大小:580059070 数据集总大小:1177205486 配置: - 配置名称:default 数据文件: - 划分:train 路径:data/train-* # TeichAI 统一合并(精选版) 这是一个经过精选、去重(deduplication)与质量验证的合并数据集,整合了21个TeichAI监督微调(supervised fine-tuning)数据集,涵盖18个前沿推理模型的回复。 ## 归属声明 **原始数据集由 [TeichAI](https://huggingface.co/TeichAI) 制作** —— 本数据集是其各模型回复数据集的清理与统一版本。所有数据收集与生成工作的版权均归TeichAI所有。 **整理者:[nlile](https://huggingface.co/nlile)** —— 负责质量验证、去重、模式统一与扩展分析。 ## 数据集概览 | 指标 | 数值 | |--------|-------| | 总样本数 | 39,463 | | 唯一提示词数 | 20,443 | | 参与模型数 | 18 | | 源数据集表数 | 21 | | 平均回复长度 | 14,452 字符 | | 包含`<think>`标签的回复占比 | 63.6% | ## 与原始版本的差异 ### 应用的过滤规则 - **移除508条空回复** —— 经验证为生成失败而非安全拦截导致的拒绝回复 - **移除252条重复回复样本** —— (提示词、回复、模型)完全匹配的重复项 - 248条来自`gemini-3-pro-preview`(250x样本是1000x样本的精确子集) - 4条来自`gpt-5-codex-250x`的内部重复项 ### 模式增强 原始模式:仅包含`messages STRUCT[]`结构 新增列说明: | 列名 | 数据类型 | 描述 | |--------|------|-------------| | `user_prompt` | VARCHAR | 从`messages[2].content`中提取的用户提示词 | | `assistant_response` | VARCHAR | 从`messages[3].content`中提取的助手回复 | | `source_table` | VARCHAR | 原始TeichAI数据集名称 | | `model_name` | VARCHAR | 生成该回复的模型名称 | | `prompt_length` | INTEGER | 提示词的字符数 | | `response_length` | INTEGER | 回复的字符数 | ## 纳入的模型 | 模型名称 | 样本数 | 平均回复长度 | 代码占比 | 含思考标签占比 | |-------|------|--------------|-------|------------| | gemini-2.5-flash | 11,086 | 16.9K | 17% | 99% | | sherlock-thinking-alpha | 11,080 | 5.2K | 36% | 0% | | deepseek-v3.2-openr1-math | 3,317 | 20.7K | 1% | 100% | | deepseek-v3.2-OpenCodeReasoning | 2,562 | 63.5K | 75% | 100% | | gpt-5-codex | 1,236 | 5.8K | 38% | 100% | | kimi-k2-thinking | 1,181 | 15.3K | 49% | 100% | | gemini-3-pro-preview | 1,018 | 7.3K | 47% | 100% | | grok-code-fast-1 | 1,017 | 4.7K | 47% | 100% | | sherlock-think-alpha | 1,017 | 4.4K | 49% | 0% | | gpt-5.1 | 1,017 | 7.5K | 46% | 100% | | polaris-alpha | 1,017 | 4.2K | 26% | 0% | | sherlock-dash-alpha | 1,014 | 5.7K | 53% | 0% | | gemini-2.5-flash-lite | 991 | 8.1K | 44% | 100% | | deepseek-v3.2-speciale | 976 | 14.8K | 45% | 100% | | claude-4.5-opus | 250 | 30.1K | 49% | 100% | | claude-sonnet-4.5 | 247 | 5.1K | 35% | 100% | | glm-4.6 | 245 | 11.6K | 39% | 100% | | grok-4-fast | 192 | 3.9K | 1% | 0% | ## 领域分布 | 领域 | 样本数 | 占比 | |--------|-------|---| | 通用/其他 | 25,159 | 63.8% | | 代码生成 | 6,291 | 15.9% | | 数学/推理 | 5,800 | 14.7% | | 创意创作 | 1,218 | 3.1% | | 哲学思辨 | 995 | 2.5% | ## 代码块中的编程语言分布 | 语言 | 样本数 | 占数据集总样本比例 | |----------|-------|--------------| | Python | 5,243 | 13.3% | | JavaScript | 1,734 | 4.4% | | HTML | 684 | 1.7% | | C++ | 654 | 1.7% | | Bash/Shell | 370 | 0.9% | | TypeScript | 287 | 0.7% | | SQL | 123 | 0.3% | | Rust | 54 | 0.1% | | Go | 50 | 0.1% | ## 执行的质量验证项 | 验证项 | 验证结果 | |-------|--------| | 个人可识别信息(邮箱、电话、社保号) | ✅ 未检测到 | | 重复回复样本 | ✅ 已移除252条 | | 空回复 | ✅ 已移除508条 | | 近似重复项 | ✅ 仅保留合法的数学解答 | | 模板化短语 | ✅ 重复率最高不超过0.55% | | 字符编码/Unicode问题 | ✅ 已清理(0.01%的回车符、0.23%的HTML实体) | | 空字节 | ✅ 未检测到 | | 未闭合的代码块 | ⚠️ 共118条,占比0.3% | ## 提示词重叠分析 为便于对比,大量提示词被发送至多个模型生成回复: | 单提示词覆盖模型数 | 样本数 | 占比 | |--------------------|-------|---| | 仅1个模型 | 12,156 | 59.5% | | 2个模型 | 7,128 | 34.9% | | 3-5个模型 | 66 | 0.3% | | 6-8个模型 | 210 | 1.0% | | 10-15个模型 | 883 | 4.3% | ## 回复长度分布 | 长度区间 | 样本数 | 占比 | |--------|-------|---| | < 100 字符 | 146 | 0.4% | | 100-500 字符 | 815 | 2.1% | | 500-1000 字符 | 990 | 2.5% | | 1000-5000 字符 | 9,481 | 24.0% | | 5000-10000 字符 | 11,592 | 29.4% | | 10000-50000 字符 | 14,777 | 37.4% | | 50000-100000 字符 | 1,094 | 2.8% | | > 100000 字符 | 568 | 1.4% | ## 源数据集 本次合并共纳入21个原始TeichAI数据集: TeichAI_brainstorm-v3.1-grok-4-fast-200x TeichAI_claude-4.5-opus-250x TeichAI_claude-sonnet-4.5-250x TeichAI_deepseek-v3.2-openr1-math-3200x TeichAI_deepseek-v3.2-speciale-1000x TeichAI_deepseek-v3.2-speciale-OpenCodeReasoning-3k TeichAI_gemini-2.5-flash-11000x TeichAI_gemini-2.5-flash-lite-1000x TeichAI_gemini-3-pro-preview-high-reasoning-1000x TeichAI_gemini-3-pro-preview-high-reasoning-250x TeichAI_glm-4.6-250x TeichAI_gpt-5-codex-1000x TeichAI_gpt-5-codex-250x TeichAI_gpt-5.1-1000x TeichAI_grok-code-fast-1-1000x TeichAI_kimi-k2-thinking-1000x TeichAI_kimi-k2-thinking-250x TeichAI_polaris-alpha-1000x TeichAI_sherlock-dash-alpha-1000x TeichAI_sherlock-think-alpha-1000x TeichAI_sherlock-thinking-alpha-11000x ## 补充说明 - **思考标签**:约64%的回复包含`<think>...</think>`推理块,这是针对推理类模型的设计。 - **表情结尾**:`sherlock-thinking-alpha`的回复常以表情(🕵️‍♂️、🚀、🔍)结尾,属于风格选择,并非截断导致。 - **空系统消息**:所有样本的原始`messages`结构中均包含空系统消息,各源数据集保持一致。
提供机构:
nlile
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作