five

huzey/claude-skills-chunk

收藏
Hugging Face2026-04-09 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/huzey/claude-skills-chunk
下载链接
链接失效反馈
官方服务:
资源简介:
--- pretty_name: claude-skills-chunk tags: - markdown - chunking - retrieval --- # huzey/claude-skills-chunk Rule-based Markdown chunks derived from `huzey/claude-skills` (source file: `claude-skills.gpt_summary.parquet`). ## Settings - `--include-heading-path-in-text`: true - `--include-frontmatter-unit`: true - `--primary-level-strategy`: `highest` - `--max-primary-level`: `3` - `--max-chars`: `4000` - `--min-chars`: `800` - `--tldr`: true (synthetic TL;DR chunk at `chunk_index_in_doc=0`) ## Columns Added From Source Dataset These columns are copied from the source dataset `huzey/claude-skills` (per skill, then repeated for every chunk row): - `skills_sh_id` (string): stable id, typically `<repo>/<slug>` - `github_stars` (int) - `skills_sh_total_installs` (int, may be null) - `skills_sh_weekly_installs` (int, may be null) Important: the HF dataset page metrics (downloads/likes) are not used here. The stats above come from the *data columns* in `huzey/claude-skills`. ## Export Notes The source parquet contains exact duplicate document rows and also multiple variants for the same `(repo, name)`. To keep identifiers stable and unique: - `doc_uid` is a SHA1 over `(repo, name, split, domain_category, description, sha1(full_content))`. - Exact duplicate rows (same inputs above) are skipped. - `chunk_uid` is a SHA1 over `(doc_uid, chunk_id, chunk_index_in_doc)`. Run summary: - `input_rows=22862` - `unique_docs=21017` - `skipped_dup_docs=1845` - `chunks=368513` - `shards=4` ## Files Parquet shards are stored under `data/` as `train-00000-of-000NN.parquet`. ## Columns (Per Chunk Row) Key fields: - `name`, `repo`, `skills_sh_id` - `github_stars`, `skills_sh_total_installs`, `skills_sh_weekly_installs` - `domain_category`, `split`, `description` - `doc_uid`, `chunk_uid`, `chunk_id`, `chunk_index_in_doc`, `unit_kind` - `heading_path_titles`, `heading_path_levels` - `body_start_line`, `body_end_line` - `char_len`, `text` ## TL;DR Format For `unit_kind=tldr` rows, the chunk text is: ```text TL;DR <Title>:<What> ``` ## Embeddings - `qwen3emb_chunk_text` (fixed_size_list[float16], dim=4096): embedding of the per-chunk `text` field using `Qwen/Qwen3-VL-Embedding-8B`. - instruction (system prompt): `Embed this text from a subsection of an AI agent skill.md file` - max_length: 2048 - max_text_chars (pre-truncate, head-only): 50000 - torch dtype: fp16 (compute), stored as fp16 - `qwen3emb_chunk_text_with_tldr` (fixed_size_list[float16], dim=4096): embedding of per-chunk text augmented with the per-document TL;DR (for non-TL;DR rows). - instruction (system prompt): `Embed this text from a subsection of an AI agent skill.md file, an overview of this skill is provided in TL;DR` - input text: `TL;DR + "\n\n" + chunk_text` (if `unit_kind=tldr`, use `chunk_text` only) - max_length: 2048 - max_text_chars (pre-truncate, head-only): 50000 - torch dtype: fp16 (compute), stored as fp16 - `qwen3emb_level2_chunk_text` (struct): embedding representing the level2 section that a row belongs to. - fields: `ref_chunk_uid` (string), `emb` (list[float16], length=4096 when present; null otherwise) - semantics: - if a level2 section has multiple level3 chunks: concatenate those level3 `text` values and embed; `emb` is stored on the first level3 row, other rows point to it via `ref_chunk_uid`. - otherwise: no new embedding is computed; `ref_chunk_uid` points to the row itself, meaning you should reuse its existing `qwen3emb_chunk_text`. - `qwen3emb_level2_chunk_text_with_tldr` (struct): same as above, but using TL;DR-augmented inputs (reusing `qwen3emb_chunk_text_with_tldr` when no new level2 embedding is computed).
提供机构:
huzey
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作