five

TeichAI/Claude-Opus-Dataclaw-Unredacted

收藏
Hugging Face2026-03-17 更新2026-04-05 收录
下载链接:
https://hf-mirror.com/datasets/TeichAI/Claude-Opus-Dataclaw-Unredacted
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - en license: mit task_categories: - text-generation --- # Claude Opus Dataclaw Unredacted ## How this dataset was built 1. Collected the local Petromallet raw export plus selected public Dataclaw uploads. 2. Filtered to the supported Opus-family source rows. 3. Deduplicated by `session_id` and first user message. 4. Converted raw assistant `tool_uses` directly into structured OpenAI-style `tool_calls`. 5. Derived per-row tool definitions from canonical schemas and observed tool usage. 6. Preserved assistant reasoning in `<think>...</think>` blocks. 7. Removed Claude Code wrapper messages such as `<local-command-stdout>...</local-command-stdout>`. 8. Ran an AI redaction-filling pass over retained rows. 9. Validated the final JSONL structure. This release does **not** synthesize missing tool-result messages. The source data mostly contains user/assistant turns plus assistant tool calls, so the rebuilt output keeps that structure. ## Output Schema Each JSONL row contains the following top-level fields: | Field | Description | | --- | --- | | `prompt` | The first user message, included for convenience | | `messages` | Conversation turns in OpenAI chat format | | `tools` | Tool definitions in OpenAI function-calling format | | `original_model` | The source model label from the original dataset row | | `source_dataset` | The local raw source identifier or Hugging Face dataset repo ID | | `metadata` | Original metadata such as `session_id`, project, branch, timing, and stats | ## Message Format Assistant messages with tool calls use native structured arguments: ```json { "role": "assistant", "content": "<think>Need to inspect the file first.</think>", "tool_calls": [ { "type": "function", "id": "call_abc123", "function": { "name": "run_command", "arguments": { "command": "ls -la" } } } ] } ``` This release contains **no `tool` role messages** because the direct rebuild does not invent missing tool outputs. ## Tool Definition Format ```json { "type": "function", "function": { "name": "run_command", "description": "Run a bash command.", "parameters": { "type": "object", "properties": { "command": {"type": "string"} }, "required": ["command"] } } } ``` ## Final Stats - **436 final rows** - **58,077 total messages** - **5,785 user messages** - **52,292 assistant messages** - **428 rows with tool definitions** - **428 rows with assistant tool calls** - **48,779 total tool calls** - **0 remaining literal `[REDACTED]` occurrences** ## Source Breakdown | Source | Rows retained | Share of final dataset | | --- | ---: | ---: | | [`peteromallet/dataclaw-peteromallet`](https://huggingface.co/datasets/peteromallet/dataclaw-peteromallet) | 320 | 73.39% | | [`tillg/dataclaw-tillg`](https://huggingface.co/datasets/tillg/dataclaw-tillg) | 55 | 12.61% | | [`woctordho/dataclaw`](https://huggingface.co/datasets/woctordho/dataclaw) | 43 | 9.86% | | [`sunsun123new/dataclaw-sunsun123new`](https://huggingface.co/datasets/sunsun123new/dataclaw-sunsun123new) | 7 | 1.61% | | [`Batman787/dataclaw-Batman787`](https://huggingface.co/datasets/Batman787/dataclaw-Batman787) | 6 | 1.38% | | [`parani01/dataclaw-parani01`](https://huggingface.co/datasets/parani01/dataclaw-parani01) | 4 | 0.92% | | [`DJTRIXUK/dataclaw-DJTRIXUK`](https://huggingface.co/datasets/DJTRIXUK/dataclaw-DJTRIXUK) | 1 | 0.23% |
提供机构:
TeichAI
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作