TeichAI/Claude-Opus-Dataclaw-Unredacted
收藏Hugging Face2026-03-17 更新2026-04-05 收录
下载链接:
https://hf-mirror.com/datasets/TeichAI/Claude-Opus-Dataclaw-Unredacted
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
license: mit
task_categories:
- text-generation
---
# Claude Opus Dataclaw Unredacted
## How this dataset was built
1. Collected the local Petromallet raw export plus selected public Dataclaw uploads.
2. Filtered to the supported Opus-family source rows.
3. Deduplicated by `session_id` and first user message.
4. Converted raw assistant `tool_uses` directly into structured OpenAI-style `tool_calls`.
5. Derived per-row tool definitions from canonical schemas and observed tool usage.
6. Preserved assistant reasoning in `<think>...</think>` blocks.
7. Removed Claude Code wrapper messages such as `<local-command-stdout>...</local-command-stdout>`.
8. Ran an AI redaction-filling pass over retained rows.
9. Validated the final JSONL structure.
This release does **not** synthesize missing tool-result messages.
The source data mostly contains user/assistant turns plus assistant tool calls, so the rebuilt output keeps that structure.
## Output Schema
Each JSONL row contains the following top-level fields:
| Field | Description |
| --- | --- |
| `prompt` | The first user message, included for convenience |
| `messages` | Conversation turns in OpenAI chat format |
| `tools` | Tool definitions in OpenAI function-calling format |
| `original_model` | The source model label from the original dataset row |
| `source_dataset` | The local raw source identifier or Hugging Face dataset repo ID |
| `metadata` | Original metadata such as `session_id`, project, branch, timing, and stats |
## Message Format
Assistant messages with tool calls use native structured arguments:
```json
{
"role": "assistant",
"content": "<think>Need to inspect the file first.</think>",
"tool_calls": [
{
"type": "function",
"id": "call_abc123",
"function": {
"name": "run_command",
"arguments": {
"command": "ls -la"
}
}
}
]
}
```
This release contains **no `tool` role messages** because the direct rebuild does not invent missing tool outputs.
## Tool Definition Format
```json
{
"type": "function",
"function": {
"name": "run_command",
"description": "Run a bash command.",
"parameters": {
"type": "object",
"properties": {
"command": {"type": "string"}
},
"required": ["command"]
}
}
}
```
## Final Stats
- **436 final rows**
- **58,077 total messages**
- **5,785 user messages**
- **52,292 assistant messages**
- **428 rows with tool definitions**
- **428 rows with assistant tool calls**
- **48,779 total tool calls**
- **0 remaining literal `[REDACTED]` occurrences**
## Source Breakdown
| Source | Rows retained | Share of final dataset |
| --- | ---: | ---: |
| [`peteromallet/dataclaw-peteromallet`](https://huggingface.co/datasets/peteromallet/dataclaw-peteromallet) | 320 | 73.39% |
| [`tillg/dataclaw-tillg`](https://huggingface.co/datasets/tillg/dataclaw-tillg) | 55 | 12.61% |
| [`woctordho/dataclaw`](https://huggingface.co/datasets/woctordho/dataclaw) | 43 | 9.86% |
| [`sunsun123new/dataclaw-sunsun123new`](https://huggingface.co/datasets/sunsun123new/dataclaw-sunsun123new) | 7 | 1.61% |
| [`Batman787/dataclaw-Batman787`](https://huggingface.co/datasets/Batman787/dataclaw-Batman787) | 6 | 1.38% |
| [`parani01/dataclaw-parani01`](https://huggingface.co/datasets/parani01/dataclaw-parani01) | 4 | 0.92% |
| [`DJTRIXUK/dataclaw-DJTRIXUK`](https://huggingface.co/datasets/DJTRIXUK/dataclaw-DJTRIXUK) | 1 | 0.23% |
提供机构:
TeichAI



