huzey/claude-skills-chunk
收藏Hugging Face2026-04-09 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/huzey/claude-skills-chunk
下载链接
链接失效反馈官方服务:
资源简介:
---
pretty_name: claude-skills-chunk
tags:
- markdown
- chunking
- retrieval
---
# huzey/claude-skills-chunk
Rule-based Markdown chunks derived from `huzey/claude-skills` (source file: `claude-skills.gpt_summary.parquet`).
## Settings
- `--include-heading-path-in-text`: true
- `--include-frontmatter-unit`: true
- `--primary-level-strategy`: `highest`
- `--max-primary-level`: `3`
- `--max-chars`: `4000`
- `--min-chars`: `800`
- `--tldr`: true (synthetic TL;DR chunk at `chunk_index_in_doc=0`)
## Columns Added From Source Dataset
These columns are copied from the source dataset `huzey/claude-skills` (per skill, then repeated for every chunk row):
- `skills_sh_id` (string): stable id, typically `<repo>/<slug>`
- `github_stars` (int)
- `skills_sh_total_installs` (int, may be null)
- `skills_sh_weekly_installs` (int, may be null)
Important: the HF dataset page metrics (downloads/likes) are not used here. The stats above come from the *data columns* in `huzey/claude-skills`.
## Export Notes
The source parquet contains exact duplicate document rows and also multiple variants for the same `(repo, name)`.
To keep identifiers stable and unique:
- `doc_uid` is a SHA1 over `(repo, name, split, domain_category, description, sha1(full_content))`.
- Exact duplicate rows (same inputs above) are skipped.
- `chunk_uid` is a SHA1 over `(doc_uid, chunk_id, chunk_index_in_doc)`.
Run summary:
- `input_rows=22862`
- `unique_docs=21017`
- `skipped_dup_docs=1845`
- `chunks=368513`
- `shards=4`
## Files
Parquet shards are stored under `data/` as `train-00000-of-000NN.parquet`.
## Columns (Per Chunk Row)
Key fields:
- `name`, `repo`, `skills_sh_id`
- `github_stars`, `skills_sh_total_installs`, `skills_sh_weekly_installs`
- `domain_category`, `split`, `description`
- `doc_uid`, `chunk_uid`, `chunk_id`, `chunk_index_in_doc`, `unit_kind`
- `heading_path_titles`, `heading_path_levels`
- `body_start_line`, `body_end_line`
- `char_len`, `text`
## TL;DR Format
For `unit_kind=tldr` rows, the chunk text is:
```text
TL;DR
<Title>:<What>
```
## Embeddings
- `qwen3emb_chunk_text` (fixed_size_list[float16], dim=4096): embedding of the per-chunk `text` field using `Qwen/Qwen3-VL-Embedding-8B`.
- instruction (system prompt): `Embed this text from a subsection of an AI agent skill.md file`
- max_length: 2048
- max_text_chars (pre-truncate, head-only): 50000
- torch dtype: fp16 (compute), stored as fp16
- `qwen3emb_chunk_text_with_tldr` (fixed_size_list[float16], dim=4096): embedding of per-chunk text augmented with the per-document TL;DR (for non-TL;DR rows).
- instruction (system prompt): `Embed this text from a subsection of an AI agent skill.md file, an overview of this skill is provided in TL;DR`
- input text: `TL;DR + "\n\n" + chunk_text` (if `unit_kind=tldr`, use `chunk_text` only)
- max_length: 2048
- max_text_chars (pre-truncate, head-only): 50000
- torch dtype: fp16 (compute), stored as fp16
- `qwen3emb_level2_chunk_text` (struct): embedding representing the level2 section that a row belongs to.
- fields: `ref_chunk_uid` (string), `emb` (list[float16], length=4096 when present; null otherwise)
- semantics:
- if a level2 section has multiple level3 chunks: concatenate those level3 `text` values and embed; `emb` is stored on the first level3 row, other rows point to it via `ref_chunk_uid`.
- otherwise: no new embedding is computed; `ref_chunk_uid` points to the row itself, meaning you should reuse its existing `qwen3emb_chunk_text`.
- `qwen3emb_level2_chunk_text_with_tldr` (struct): same as above, but using TL;DR-augmented inputs (reusing `qwen3emb_chunk_text_with_tldr` when no new level2 embedding is computed).
提供机构:
huzey



