lordx64-claude-opus-4.7-max-cleaned
收藏魔搭社区2026-05-01 更新2026-05-03 收录
下载链接:
https://modelscope.cn/datasets/TeichAI/lordx64-claude-opus-4.7-max-cleaned
下载链接
链接失效反馈官方服务:
资源简介:
# reasoning-distill-claude-opus-4-7-max-cleaned
Cleaned version of [`lordx64/reasoning-distill-claude-opus-4-7-max`](https://huggingface.co/datasets/lordx64/reasoning-distill-claude-opus-4-7-max).
See the original dataset for full provenance, collection methodology, and terms of use.
## Cleaning steps
| Step | Filter | Reason | Rows removed |
|------|--------|--------|--------------|
| 1 | Simulated thinking (`...`) | Rows with `...` in thinking/response indicate the model learned to simulate reasoning (e.g., "Now I'm laying out the puzzle grids...") rather than actually performing it. This causes failures in agentic tasks and hallucinations during thinking. | 1,230 (15.1%) |
| 2 | Duplicate prompts | Deduplicated by exact prompt match, keeping the first occurrence. | 989 (12.2%) |
| 3 | Missing fields | Rows without valid `thinking`, `prompt`, and `response` content (empty or null values). | 1,098 (13.5%) |
| **Total** | | | **3,317 (40.8%)** |
| Metric | Value |
|--------|-------|
| Original rows | 8,124 |
| Final rows | 4,807 |
| Retention rate | 59.2% |
## Format
Each row is a JSON object with the `messages` column first, following the standard chat format:
```json
{
"messages": [
{"role": "system", "content": "You are a helpful assistant", "thinking": null},
{"role": "user", "content": "...", "thinking": null},
{"role": "assistant", "content": "...", "thinking": "..."}
],
"system": "You are a helpful assistant",
"prompt": "...",
"thinking": "...",
"response": "...",
"model": "claude-opus-4-7"
}
```
## License
Apache-2.0 (dataset packaging). Content subject to upstream [Anthropic usage policies](https://www.anthropic.com/legal/usage-policy).
# reasoning-distill-claude-opus-4-7-max-cleaned
本数据集为 [`lordx64/reasoning-distill-claude-opus-4-7-max`](https://huggingface.co/datasets/lordx64/reasoning-distill-claude-opus-4-7-max) 的清洗后版本。如需获取完整的溯源信息、采集方法及使用条款,请参阅原始数据集。
## 清洗步骤
| 步骤 | 过滤规则 | 过滤原因 | 移除行数 |
|------|--------|--------|--------------|
| 1 | 模拟思考(`...`) | 若思考或回复字段中包含`...`,则表明模型仅学会模拟推理过程(例如"Now I'm laying out the puzzle grids..."),而非真正执行推理。这会导致AI智能体(AI Agent)任务失败,并在思考阶段产生幻觉。 | 1230条(占比15.1%) |
| 2 | 重复提示词 | 按精确提示词匹配去重,保留首次出现的条目。 | 989条(占比12.2%) |
| 3 | 缺失字段 | 缺少有效`thinking`(思考内容)、`prompt`(提示词)和`response`(回复内容)字段(值为空或null)的条目。 | 1098条(占比13.5%) |
| **总计** | | | **3317条(占比40.8%)** |
| 指标 | 数值 |
|--------|-------|
| 原始行数 | 8124 |
| 最终行数 | 4807 |
| 留存率 | 59.2% |
## 数据格式
每行均为JSON对象,`messages`字段置于首位,遵循标准对话格式:
json
{
"messages": [
{"role": "system", "content": "你是一名乐于助人的助手", "thinking": null},
{"role": "user", "content": "...", "thinking": null},
{"role": "assistant", "content": "...", "thinking": "..."}
],
"system": "你是一名乐于助人的助手",
"prompt": "...",
"thinking": "...",
"response": "...",
"model": "claude-opus-4-7"
}
## 许可协议
本数据集采用Apache-2.0协议(仅针对数据集打包部分)。数据集内容需遵循上游[Anthropic使用政策](https://www.anthropic.com/legal/usage-policy)。
提供机构:
maas
创建时间:
2026-04-27



