five

FayeZC/SkillMD-138K

收藏
Hugging Face2026-04-08 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/FayeZC/SkillMD-138K
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-4.0 language: - en size_categories: - 100K<n<1M tags: - agent-skills - skill-md - llm-agents - claude-code - copilot - cursor - gemini-cli - codex - prompt-engineering - procedural-knowledge - software-engineering - code-quality - cybersecurity - skill-files pretty_name: SkillMD-138K --- # SkillMD-138K A large public collection of Agent Skill files (`SKILL.md`) for empirical research. ## Overview | Metric | Value | |---|---| | Total skills | 138,133 | | Distinct repositories | 20,556 | | Deduplicated | Yes (SHA-256 content hash) | ## What are Agent Skills? Agent Skills are modular instruction files (typically named `SKILL.md`) that extend LLM agent capabilities without fine-tuning. Each skill contains YAML frontmatter (routing metadata) and a Markdown body (instructions). The format has been adopted by 30+ platforms including Claude Code, GitHub Copilot, Cursor, Gemini CLI, and OpenAI Codex. ## Collection Skills were collected from three complementary sources: 1. **GitHub Code Search** — 40+ sharded queries across path, language, content, and star ranges 2. **Repository Cloning** — Known skill repositories and discovered repos via search 3. **Registry API** — agentskills.in registry (216,000+ indexed skills) All files are deduplicated by SHA-256 content hash. Each record preserves the original source metadata. ## Schema | Column | Type | Description | |---|---|---| | `content_hash` | string | SHA-256 hash of the skill content (first 16 chars used as file ID) | | `repo` | string | GitHub repository (e.g., `facebook/react`) | | `path` | string | File path within the repository | | `stars` | int | Repository star count at collection time | | `source` | string | Collection method (`search`, `clone`, or `registry`) | | `html_url` | string | GitHub URL to the original file | | `content` | string | Full text content of the SKILL.md file | | `lines` | int | Line count | | `words` | int | Word count | ## Usage ```python from datasets import load_dataset ds = load_dataset("FayeZC/SkillMD-138K") print(ds["train"][0]) ``` ## License and Attribution **Dataset compilation:** CC-BY-4.0. The curation, deduplication, metadata, and documentation of this dataset are licensed under CC-BY-4.0. **Individual skill files:** Each skill file in this dataset originates from a public GitHub repository and retains the copyright and license of its original author/repository. The `repo` and `html_url` fields identify the source of each file. Users should consult the original repository's license before using individual skill contents beyond research purposes. **Fair use and research:** This dataset is compiled for academic research under fair use principles. The inclusion of skill files is for the purpose of empirical analysis and does not imply any transfer of rights from the original authors. **Attribution:** If you use individual skills from this dataset, please attribute the original repository and author. **Removal requests:** If you are the author of a skill file included in this dataset and wish to have it removed, please open an issue on this repository or contact us directly.
提供机构:
FayeZC
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作