five

nisten/deepwiki-public-repo-reviews

收藏
Hugging Face2026-04-15 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/nisten/deepwiki-public-repo-reviews
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: mit language: - en tags: - code - documentation - technical - mermaid - github - repositories - deepwiki - sharegpt - instruction-tuning task_categories: - text-generation - question-answering pretty_name: DeepWiki Repository Summaries (ShareGPT) size_categories: - 1K<n<10K dataset_info: features: - name: id dtype: string - name: conversations list: - name: from dtype: string - name: value dtype: string - name: source dtype: string - name: repo dtype: string - name: pages dtype: int32 - name: bytes dtype: int32 - name: fetched_at dtype: int64 - name: fetched_date dtype: string splits: - name: train num_examples: 6920 --- # DeepWiki Repository Summaries A single-turn instruction dataset of full technical repository summaries, sourced from [DeepWiki](https://deepwiki.com) — an AI-generated wiki platform for GitHub repositories. Each record is one question → one long-form answer covering the architecture, components, data flows, APIs, and implementation details of a GitHub repository. ## Archive Status - LAST UPDATED APRIL 15 2026 | Stat | Value | |------|-------| | **Repos archived (done)** | 6,920 | | **Repos seeded (total probed)** | 7,852 | | **Missing (not indexed on DeepWiki)** | 921 | | **Total content (uncompressed)** | ~2.3 GB | | **Parquet file size (zstd)** | ~635 MB | | **Last refreshed** | 2026-04-15 | | **Source API** | deepwiki.com public MCP (`read_wiki_contents`) | ## Format ShareGPT / ChatML single-turn format — **not multi-turn**. ```json { "id": "deepwiki/facebook/react", "conversations": [ { "from": "human", "value": "Give me a full technical summary of this repo" }, { "from": "gpt", "value": "# facebook/react\n\n# Overview\n\n...(full wiki)...\n\n---\n*Source: deepwiki.com/facebook/react | Archived: 2026-04-15 | License: MIT*" } ], "source": "deepwiki", "repo": "facebook/react", "pages": 12, "bytes": 48291, "fetched_at": 1744900000, "fetched_date": "2026-04-15" } ``` ## Fields | Field | Type | Description | |-------|------|-------------| | `id` | string | Unique ID — `deepwiki/<owner>/<repo>` | | `conversations` | list | `[{"from":"human","value":"..."}, {"from":"gpt","value":"..."}]` | | `source` | string | Always `"deepwiki"` | | `repo` | string | GitHub slug (`<owner>/<repo>`) | | `pages` | int32 | Number of wiki pages in the original DeepWiki article | | `bytes` | int32 | UTF-8 byte size of the assistant response (content only, excl. footer) | | `fetched_at` | int64 | Unix timestamp when this wiki was archived | | `fetched_date` | string | ISO date of archival (`YYYY-MM-DD`) | ## Parquet schema The `.parquet` file splits conversations into two columns for easier tooling: | Column | Type | |--------|------| | `id` | string | | `repo` | string | | `human` | large_string | | `gpt` | large_string | | `pages` | int32 | | `bytes` | int32 | | `fetched_at` | int64 | | `fetched_date` | string | ## Content details **The `gpt` response contains:** 1. Full multi-page DeepWiki wiki — all pages joined with `---` separators 2. Footer: `*Source: deepwiki.com/<repo> | Archived: YYYY-MM-DD | License: MIT*` **Total approx tokens:** 541,731,172 **Min:** 1,269 tokens (`tezos/tezos`) **Max:** 581,970 tokens (`iflytek/astron-agent`) > Token count approximation: `len(gpt_response) // 4` (chars ÷ 4, standard English/code heuristic) ## Distribution by approximate token length | Bucket | Count | % of total | Avg tokens | Cumulative % | |--------|------:|----------:|----------:|-------------:| | < 2048 | 2 | 0.0% | 1,538 | 0.0% | | < 8192 | 11 | 0.2% | 6,080 | 0.2% | | < 10k | 10 | 0.1% | 9,278 | 0.3% | | < 20k | 129 | 1.9% | 16,215 | 2.2% | | < 50k | 2,436 | 35.2% | 37,576 | 37.4% | | < 100k | 2,857 | 41.3% | 68,086 | 78.7% | | < 150k | 711 | 10.3% | 122,716 | 89.0% | | < 200k | 401 | 5.8% | 173,130 | 94.8% | | < 250k | 186 | 2.7% | 222,502 | 97.4% | | < 300k | 96 | 1.4% | 272,760 | 98.8% | | < 350k | 47 | 0.7% | 321,531 | 99.5% | | < 400k | 21 | 0.3% | 370,791 | 99.8% | | < 450k | 6 | 0.1% | 425,224 | 99.9% | | < 500k | 2 | 0.0% | 482,068 | 99.9% | | ≥ 500k | 5 | 0.1% | 551,941 | 100.0% | ## Cumulative coverage (how many repos fit in common context windows) | Context window | Max tokens | Records that fit | % of dataset | |----------------|----------:|----------------:|-------------:| | 8K ctx | 8,192 | 13 | 0.2% | | 32K ctx | 32,768 | 152 | 2.2% | | 64K ctx | 64,000 | 2,588 | 37.4% | | 128K ctx | 131,072 | 5,445 | 78.7% | | 200K ctx | 200,000 | 6,557 | 94.8% | | 500K ctx | 500,000 | 6,920 | 100.0% | ## Notes - Token counts are approximate (`len(text) // 4`). Actual BPE counts may vary ±20%. - `gpt` field includes the footer: `*Source: deepwiki.com/... | Archived: ...*` - Boilerplate ("The following files were used as context...") stripped at export time. - Dataset generated from archive at `/mnt/deepwiki-vol/` on 2026-04-15. **Cleaned:** - "The following files were used as context..." boilerplate stripped from every page - Triple blank lines collapsed - Duplicate `---` separators removed **Mermaid diagrams preserved** — diagram source (```mermaid ... ```) is kept verbatim. Common types: `graph TD/LR`, `sequenceDiagram`, `classDiagram`, `erDiagram`, `stateDiagram-v2`, `mindmap` ## Repository coverage | Source | Count | |--------|-------| | GitHub repos ≥ 1,000 stars | ~7,000 seeded | | lucidrains (ML researcher, 100+ repos) | ~80 archived | | AlpinDale (LLM inference forks) | ~50 archived | | nisten org | ~40 archived | | alpinelinux org | ~20 archived | | lantos1618 | ~15 archived | | Curated list (DeepWiki homepage + hand-picked) | ~200 seeded | Repos not yet indexed on DeepWiki at archival time are excluded (marked missing). ## Token length distribution See `stats.md` in this directory for full token-length histogram. Approximate context window coverage: - **< 2K tokens**: small repos / overview-only wikis - **2K–32K tokens**: typical repos (most of the dataset) - **32K–128K tokens**: large repos (Linux kernel, PyTorch, VSCode, etc.) - **> 128K tokens**: very large monorepos (rare) ## Usage ```python # HuggingFace datasets from datasets import load_dataset ds = load_dataset("your-username/deepwiki-sharegpt") # Local parquet import pandas as pd df = pd.read_parquet("deepwiki_sharegpt.parquet") print(df[df.repo == "facebook/react"].iloc[0]["gpt"][:1000]) # Local JSONL import json with open("deepwiki_sharegpt.jsonl") as f: records = [json.loads(l) for l in f] # Filter by size small = df[df.bytes < 20_000] # < ~5K tokens large = df[df.bytes > 100_000] # > ~25K tokens ``` ```python # Fine-tuning with TRL / transformers from trl import SFTTrainer from datasets import load_dataset ds = load_dataset("your-username/deepwiki-sharegpt") def format_record(ex): h = ex["conversations"][0]["value"] g = ex["conversations"][1]["value"] return {"text": f"<|user|>\n{h}\n<|assistant|>\n{g}"} ds = ds.map(format_record) ``` ## License **MIT** — you are free to use, modify, and redistribute with attribution. - Dataset compilation: MIT - Original wiki content: generated by DeepWiki from public GitHub repositories - Each referenced GitHub repository retains its own license ## Citation ```bibtex @dataset{deepwiki_sharegpt_2026, title = {DeepWiki Repository Summaries (ShareGPT)}, year = {2026}, month = {4}, url = {https://huggingface.co/datasets/your-username/deepwiki-sharegpt}, license = {mit}, note = {6,920 GitHub repository wikis archived from deepwiki.com via MCP API, April 2026} } ```
提供机构:
nisten
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作