nlile/TeichAI-curated-sft-39k
收藏Hugging Face2025-12-11 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/nlile/TeichAI-curated-sft-39k
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: messages
list:
- name: role
dtype: string
- name: content
dtype: string
- name: user_prompt
dtype: string
- name: assistant_response
dtype: string
- name: source_table
dtype: string
- name: model_name
dtype: string
- name: prompt_length
dtype: int64
- name: response_length
dtype: int64
splits:
- name: train
num_bytes: 1177205486
num_examples: 39463
download_size: 580059070
dataset_size: 1177205486
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
---
# TeichAI Unified Merge (Curated)
A curated, deduplicated, and quality-validated merge of 21 TeichAI supervised fine-tuning datasets featuring responses from 18 frontier reasoning models.
## Attribution
**Original datasets by [TeichAI](https://huggingface.co/TeichAI)** - This is a cleaned and unified version of their individual model response datasets. All credit for data collection and generation goes to TeichAI.
**Curation by [nlile](https://huggingface.co/nlile)** - Quality validation, deduplication, schema unification, and extended analysis.
## Dataset Summary
| Metric | Value |
|--------|-------|
| Total rows | 39,463 |
| Unique prompts | 20,443 |
| Unique models | 18 |
| Source tables | 21 |
| Avg response length | 14,452 chars |
| Responses with `<think>` tags | 63.6% |
## What Changed From Original
### Filtering Applied
- **508 empty responses removed** - Validated as generation failures, not safety refusals
- **252 duplicate rollouts removed** - Exact (prompt, response, model) matches
- 248 from `gemini-3-pro-preview` (250x was exact subset of 1000x)
- 4 internal duplicates from `gpt-5-codex-250x`
### Schema Enhancements
Original schema: `messages STRUCT[]` only
New columns added:
| Column | Type | Description |
|--------|------|-------------|
| `user_prompt` | VARCHAR | Extracted from `messages[2].content` |
| `assistant_response` | VARCHAR | Extracted from `messages[3].content` |
| `source_table` | VARCHAR | Original TeichAI dataset name |
| `model_name` | VARCHAR | Inferred generating model |
| `prompt_length` | INTEGER | Character count of prompt |
| `response_length` | INTEGER | Character count of response |
## Models Included
| Model | Rows | Avg Response | Code% | Think Tags |
|-------|------|--------------|-------|------------|
| gemini-2.5-flash | 11,086 | 16.9K | 17% | 99% |
| sherlock-thinking-alpha | 11,080 | 5.2K | 36% | 0% |
| deepseek-v3.2-openr1-math | 3,317 | 20.7K | 1% | 100% |
| deepseek-v3.2-OpenCodeReasoning | 2,562 | 63.5K | 75% | 100% |
| gpt-5-codex | 1,236 | 5.8K | 38% | 100% |
| kimi-k2-thinking | 1,181 | 15.3K | 49% | 100% |
| gemini-3-pro-preview | 1,018 | 7.3K | 47% | 100% |
| grok-code-fast-1 | 1,017 | 4.7K | 47% | 100% |
| sherlock-think-alpha | 1,017 | 4.4K | 49% | 0% |
| gpt-5.1 | 1,017 | 7.5K | 46% | 100% |
| polaris-alpha | 1,017 | 4.2K | 26% | 0% |
| sherlock-dash-alpha | 1,014 | 5.7K | 53% | 0% |
| gemini-2.5-flash-lite | 991 | 8.1K | 44% | 100% |
| deepseek-v3.2-speciale | 976 | 14.8K | 45% | 100% |
| claude-4.5-opus | 250 | 30.1K | 49% | 100% |
| claude-sonnet-4.5 | 247 | 5.1K | 35% | 100% |
| glm-4.6 | 245 | 11.6K | 39% | 100% |
| grok-4-fast | 192 | 3.9K | 1% | 0% |
## Domain Distribution
| Domain | Count | % |
|--------|-------|---|
| General/Other | 25,159 | 63.8% |
| Coding | 6,291 | 15.9% |
| Math/Reasoning | 5,800 | 14.7% |
| Creative | 1,218 | 3.1% |
| Philosophy | 995 | 2.5% |
## Programming Languages in Code Blocks
| Language | Count | % of Dataset |
|----------|-------|--------------|
| Python | 5,243 | 13.3% |
| JavaScript | 1,734 | 4.4% |
| HTML | 684 | 1.7% |
| C++ | 654 | 1.7% |
| Bash/Shell | 370 | 0.9% |
| TypeScript | 287 | 0.7% |
| SQL | 123 | 0.3% |
| Rust | 54 | 0.1% |
| Go | 50 | 0.1% |
## Quality Validation Performed
| Check | Result |
|-------|--------|
| PII (emails, phones, SSNs) | ✅ Zero detected |
| Duplicate rollouts | ✅ Removed 252 |
| Empty responses | ✅ Removed 508 |
| Near-duplicates | ✅ Only valid math answers |
| Boilerplate phrases | ✅ Max 0.55% repetition |
| Unicode/encoding issues | ✅ Clean (0.01% CR, 0.23% HTML entities) |
| Null bytes | ✅ Zero |
| Unclosed code blocks | ⚠️ 118 rows (0.3%) |
## Prompt Overlap Analysis
Many prompts were intentionally sent to multiple models for comparison:
| Prompts in N Models | Count | % |
|--------------------|-------|---|
| 1 model only | 12,156 | 59.5% |
| 2 models | 7,128 | 34.9% |
| 3-5 models | 66 | 0.3% |
| 6-8 models | 210 | 1.0% |
| 10-15 models | 883 | 4.3% |
## Response Length Distribution
| Bucket | Count | % |
|--------|-------|---|
| < 100 chars | 146 | 0.4% |
| 100-500 | 815 | 2.1% |
| 500-1K | 990 | 2.5% |
| 1K-5K | 9,481 | 24.0% |
| 5K-10K | 11,592 | 29.4% |
| 10K-50K | 14,777 | 37.4% |
| 50K-100K | 1,094 | 2.8% |
| > 100K | 568 | 1.4% |
## Source Datasets
All 21 original TeichAI datasets merged:
```
TeichAI_brainstorm-v3.1-grok-4-fast-200x
TeichAI_claude-4.5-opus-250x
TeichAI_claude-sonnet-4.5-250x
TeichAI_deepseek-v3.2-openr1-math-3200x
TeichAI_deepseek-v3.2-speciale-1000x
TeichAI_deepseek-v3.2-speciale-OpenCodeReasoning-3k
TeichAI_gemini-2.5-flash-11000x
TeichAI_gemini-2.5-flash-lite-1000x
TeichAI_gemini-3-pro-preview-high-reasoning-1000x
TeichAI_gemini-3-pro-preview-high-reasoning-250x
TeichAI_glm-4.6-250x
TeichAI_gpt-5-codex-1000x
TeichAI_gpt-5-codex-250x
TeichAI_gpt-5.1-1000x
TeichAI_grok-code-fast-1-1000x
TeichAI_kimi-k2-thinking-1000x
TeichAI_kimi-k2-thinking-250x
TeichAI_polaris-alpha-1000x
TeichAI_sherlock-dash-alpha-1000x
TeichAI_sherlock-think-alpha-1000x
TeichAI_sherlock-thinking-alpha-11000x
```
## Notes
- **Thinking tags**: ~64% of responses contain `<think>...</think>` reasoning blocks. This is intentional for reasoning-focused models.
- **Emoji endings**: `sherlock-thinking-alpha` responses often end with emojis (🕵️♂️, 🚀, 🔍) - stylistic choice, not truncation.
- **Empty system messages**: All rows have empty system messages in the original `messages` struct (consistent across all sources).
数据集信息:
特征:
- 名称:messages
列表:
- 名称:role
数据类型:字符串
- 名称:content
数据类型:字符串
- 名称:user_prompt
数据类型:字符串
- 名称:assistant_response
数据类型:字符串
- 名称:source_table
数据类型:字符串
- 名称:model_name
数据类型:字符串
- 名称:prompt_length
数据类型:64位整数
- 名称:response_length
数据类型:64位整数
划分:
- 名称:train
字节数:1177205486
样本数:39463
下载大小:580059070
数据集总大小:1177205486
配置:
- 配置名称:default
数据文件:
- 划分:train
路径:data/train-*
# TeichAI 统一合并(精选版)
这是一个经过精选、去重(deduplication)与质量验证的合并数据集,整合了21个TeichAI监督微调(supervised fine-tuning)数据集,涵盖18个前沿推理模型的回复。
## 归属声明
**原始数据集由 [TeichAI](https://huggingface.co/TeichAI) 制作** —— 本数据集是其各模型回复数据集的清理与统一版本。所有数据收集与生成工作的版权均归TeichAI所有。
**整理者:[nlile](https://huggingface.co/nlile)** —— 负责质量验证、去重、模式统一与扩展分析。
## 数据集概览
| 指标 | 数值 |
|--------|-------|
| 总样本数 | 39,463 |
| 唯一提示词数 | 20,443 |
| 参与模型数 | 18 |
| 源数据集表数 | 21 |
| 平均回复长度 | 14,452 字符 |
| 包含`<think>`标签的回复占比 | 63.6% |
## 与原始版本的差异
### 应用的过滤规则
- **移除508条空回复** —— 经验证为生成失败而非安全拦截导致的拒绝回复
- **移除252条重复回复样本** —— (提示词、回复、模型)完全匹配的重复项
- 248条来自`gemini-3-pro-preview`(250x样本是1000x样本的精确子集)
- 4条来自`gpt-5-codex-250x`的内部重复项
### 模式增强
原始模式:仅包含`messages STRUCT[]`结构
新增列说明:
| 列名 | 数据类型 | 描述 |
|--------|------|-------------|
| `user_prompt` | VARCHAR | 从`messages[2].content`中提取的用户提示词 |
| `assistant_response` | VARCHAR | 从`messages[3].content`中提取的助手回复 |
| `source_table` | VARCHAR | 原始TeichAI数据集名称 |
| `model_name` | VARCHAR | 生成该回复的模型名称 |
| `prompt_length` | INTEGER | 提示词的字符数 |
| `response_length` | INTEGER | 回复的字符数 |
## 纳入的模型
| 模型名称 | 样本数 | 平均回复长度 | 代码占比 | 含思考标签占比 |
|-------|------|--------------|-------|------------|
| gemini-2.5-flash | 11,086 | 16.9K | 17% | 99% |
| sherlock-thinking-alpha | 11,080 | 5.2K | 36% | 0% |
| deepseek-v3.2-openr1-math | 3,317 | 20.7K | 1% | 100% |
| deepseek-v3.2-OpenCodeReasoning | 2,562 | 63.5K | 75% | 100% |
| gpt-5-codex | 1,236 | 5.8K | 38% | 100% |
| kimi-k2-thinking | 1,181 | 15.3K | 49% | 100% |
| gemini-3-pro-preview | 1,018 | 7.3K | 47% | 100% |
| grok-code-fast-1 | 1,017 | 4.7K | 47% | 100% |
| sherlock-think-alpha | 1,017 | 4.4K | 49% | 0% |
| gpt-5.1 | 1,017 | 7.5K | 46% | 100% |
| polaris-alpha | 1,017 | 4.2K | 26% | 0% |
| sherlock-dash-alpha | 1,014 | 5.7K | 53% | 0% |
| gemini-2.5-flash-lite | 991 | 8.1K | 44% | 100% |
| deepseek-v3.2-speciale | 976 | 14.8K | 45% | 100% |
| claude-4.5-opus | 250 | 30.1K | 49% | 100% |
| claude-sonnet-4.5 | 247 | 5.1K | 35% | 100% |
| glm-4.6 | 245 | 11.6K | 39% | 100% |
| grok-4-fast | 192 | 3.9K | 1% | 0% |
## 领域分布
| 领域 | 样本数 | 占比 |
|--------|-------|---|
| 通用/其他 | 25,159 | 63.8% |
| 代码生成 | 6,291 | 15.9% |
| 数学/推理 | 5,800 | 14.7% |
| 创意创作 | 1,218 | 3.1% |
| 哲学思辨 | 995 | 2.5% |
## 代码块中的编程语言分布
| 语言 | 样本数 | 占数据集总样本比例 |
|----------|-------|--------------|
| Python | 5,243 | 13.3% |
| JavaScript | 1,734 | 4.4% |
| HTML | 684 | 1.7% |
| C++ | 654 | 1.7% |
| Bash/Shell | 370 | 0.9% |
| TypeScript | 287 | 0.7% |
| SQL | 123 | 0.3% |
| Rust | 54 | 0.1% |
| Go | 50 | 0.1% |
## 执行的质量验证项
| 验证项 | 验证结果 |
|-------|--------|
| 个人可识别信息(邮箱、电话、社保号) | ✅ 未检测到 |
| 重复回复样本 | ✅ 已移除252条 |
| 空回复 | ✅ 已移除508条 |
| 近似重复项 | ✅ 仅保留合法的数学解答 |
| 模板化短语 | ✅ 重复率最高不超过0.55% |
| 字符编码/Unicode问题 | ✅ 已清理(0.01%的回车符、0.23%的HTML实体) |
| 空字节 | ✅ 未检测到 |
| 未闭合的代码块 | ⚠️ 共118条,占比0.3% |
## 提示词重叠分析
为便于对比,大量提示词被发送至多个模型生成回复:
| 单提示词覆盖模型数 | 样本数 | 占比 |
|--------------------|-------|---|
| 仅1个模型 | 12,156 | 59.5% |
| 2个模型 | 7,128 | 34.9% |
| 3-5个模型 | 66 | 0.3% |
| 6-8个模型 | 210 | 1.0% |
| 10-15个模型 | 883 | 4.3% |
## 回复长度分布
| 长度区间 | 样本数 | 占比 |
|--------|-------|---|
| < 100 字符 | 146 | 0.4% |
| 100-500 字符 | 815 | 2.1% |
| 500-1000 字符 | 990 | 2.5% |
| 1000-5000 字符 | 9,481 | 24.0% |
| 5000-10000 字符 | 11,592 | 29.4% |
| 10000-50000 字符 | 14,777 | 37.4% |
| 50000-100000 字符 | 1,094 | 2.8% |
| > 100000 字符 | 568 | 1.4% |
## 源数据集
本次合并共纳入21个原始TeichAI数据集:
TeichAI_brainstorm-v3.1-grok-4-fast-200x
TeichAI_claude-4.5-opus-250x
TeichAI_claude-sonnet-4.5-250x
TeichAI_deepseek-v3.2-openr1-math-3200x
TeichAI_deepseek-v3.2-speciale-1000x
TeichAI_deepseek-v3.2-speciale-OpenCodeReasoning-3k
TeichAI_gemini-2.5-flash-11000x
TeichAI_gemini-2.5-flash-lite-1000x
TeichAI_gemini-3-pro-preview-high-reasoning-1000x
TeichAI_gemini-3-pro-preview-high-reasoning-250x
TeichAI_glm-4.6-250x
TeichAI_gpt-5-codex-1000x
TeichAI_gpt-5-codex-250x
TeichAI_gpt-5.1-1000x
TeichAI_grok-code-fast-1-1000x
TeichAI_kimi-k2-thinking-1000x
TeichAI_kimi-k2-thinking-250x
TeichAI_polaris-alpha-1000x
TeichAI_sherlock-dash-alpha-1000x
TeichAI_sherlock-think-alpha-1000x
TeichAI_sherlock-thinking-alpha-11000x
## 补充说明
- **思考标签**:约64%的回复包含`<think>...</think>`推理块,这是针对推理类模型的设计。
- **表情结尾**:`sherlock-thinking-alpha`的回复常以表情(🕵️♂️、🚀、🔍)结尾,属于风格选择,并非截断导致。
- **空系统消息**:所有样本的原始`messages`结构中均包含空系统消息,各源数据集保持一致。
提供机构:
nlile



