nisten/deepwiki-public-repo-reviews
收藏Hugging Face2026-04-15 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/nisten/deepwiki-public-repo-reviews
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mit
language:
- en
tags:
- code
- documentation
- technical
- mermaid
- github
- repositories
- deepwiki
- sharegpt
- instruction-tuning
task_categories:
- text-generation
- question-answering
pretty_name: DeepWiki Repository Summaries (ShareGPT)
size_categories:
- 1K<n<10K
dataset_info:
features:
- name: id
dtype: string
- name: conversations
list:
- name: from
dtype: string
- name: value
dtype: string
- name: source
dtype: string
- name: repo
dtype: string
- name: pages
dtype: int32
- name: bytes
dtype: int32
- name: fetched_at
dtype: int64
- name: fetched_date
dtype: string
splits:
- name: train
num_examples: 6920
---
# DeepWiki Repository Summaries
A single-turn instruction dataset of full technical repository summaries, sourced from
[DeepWiki](https://deepwiki.com) — an AI-generated wiki platform for GitHub repositories.
Each record is one question → one long-form answer covering the architecture, components,
data flows, APIs, and implementation details of a GitHub repository.
## Archive Status - LAST UPDATED APRIL 15 2026
| Stat | Value |
|------|-------|
| **Repos archived (done)** | 6,920 |
| **Repos seeded (total probed)** | 7,852 |
| **Missing (not indexed on DeepWiki)** | 921 |
| **Total content (uncompressed)** | ~2.3 GB |
| **Parquet file size (zstd)** | ~635 MB |
| **Last refreshed** | 2026-04-15 |
| **Source API** | deepwiki.com public MCP (`read_wiki_contents`) |
## Format
ShareGPT / ChatML single-turn format — **not multi-turn**.
```json
{
"id": "deepwiki/facebook/react",
"conversations": [
{
"from": "human",
"value": "Give me a full technical summary of this repo"
},
{
"from": "gpt",
"value": "# facebook/react\n\n# Overview\n\n...(full wiki)...\n\n---\n*Source: deepwiki.com/facebook/react | Archived: 2026-04-15 | License: MIT*"
}
],
"source": "deepwiki",
"repo": "facebook/react",
"pages": 12,
"bytes": 48291,
"fetched_at": 1744900000,
"fetched_date": "2026-04-15"
}
```
## Fields
| Field | Type | Description |
|-------|------|-------------|
| `id` | string | Unique ID — `deepwiki/<owner>/<repo>` |
| `conversations` | list | `[{"from":"human","value":"..."}, {"from":"gpt","value":"..."}]` |
| `source` | string | Always `"deepwiki"` |
| `repo` | string | GitHub slug (`<owner>/<repo>`) |
| `pages` | int32 | Number of wiki pages in the original DeepWiki article |
| `bytes` | int32 | UTF-8 byte size of the assistant response (content only, excl. footer) |
| `fetched_at` | int64 | Unix timestamp when this wiki was archived |
| `fetched_date` | string | ISO date of archival (`YYYY-MM-DD`) |
## Parquet schema
The `.parquet` file splits conversations into two columns for easier tooling:
| Column | Type |
|--------|------|
| `id` | string |
| `repo` | string |
| `human` | large_string |
| `gpt` | large_string |
| `pages` | int32 |
| `bytes` | int32 |
| `fetched_at` | int64 |
| `fetched_date` | string |
## Content details
**The `gpt` response contains:**
1. Full multi-page DeepWiki wiki — all pages joined with `---` separators
2. Footer: `*Source: deepwiki.com/<repo> | Archived: YYYY-MM-DD | License: MIT*`
**Total approx tokens:** 541,731,172
**Min:** 1,269 tokens (`tezos/tezos`)
**Max:** 581,970 tokens (`iflytek/astron-agent`)
> Token count approximation: `len(gpt_response) // 4` (chars ÷ 4, standard English/code heuristic)
## Distribution by approximate token length
| Bucket | Count | % of total | Avg tokens | Cumulative % |
|--------|------:|----------:|----------:|-------------:|
| < 2048 | 2 | 0.0% | 1,538 | 0.0% |
| < 8192 | 11 | 0.2% | 6,080 | 0.2% |
| < 10k | 10 | 0.1% | 9,278 | 0.3% |
| < 20k | 129 | 1.9% | 16,215 | 2.2% |
| < 50k | 2,436 | 35.2% | 37,576 | 37.4% |
| < 100k | 2,857 | 41.3% | 68,086 | 78.7% |
| < 150k | 711 | 10.3% | 122,716 | 89.0% |
| < 200k | 401 | 5.8% | 173,130 | 94.8% |
| < 250k | 186 | 2.7% | 222,502 | 97.4% |
| < 300k | 96 | 1.4% | 272,760 | 98.8% |
| < 350k | 47 | 0.7% | 321,531 | 99.5% |
| < 400k | 21 | 0.3% | 370,791 | 99.8% |
| < 450k | 6 | 0.1% | 425,224 | 99.9% |
| < 500k | 2 | 0.0% | 482,068 | 99.9% |
| ≥ 500k | 5 | 0.1% | 551,941 | 100.0% |
## Cumulative coverage (how many repos fit in common context windows)
| Context window | Max tokens | Records that fit | % of dataset |
|----------------|----------:|----------------:|-------------:|
| 8K ctx | 8,192 | 13 | 0.2% |
| 32K ctx | 32,768 | 152 | 2.2% |
| 64K ctx | 64,000 | 2,588 | 37.4% |
| 128K ctx | 131,072 | 5,445 | 78.7% |
| 200K ctx | 200,000 | 6,557 | 94.8% |
| 500K ctx | 500,000 | 6,920 | 100.0% |
## Notes
- Token counts are approximate (`len(text) // 4`). Actual BPE counts may vary ±20%.
- `gpt` field includes the footer: `*Source: deepwiki.com/... | Archived: ...*`
- Boilerplate ("The following files were used as context...") stripped at export time.
- Dataset generated from archive at `/mnt/deepwiki-vol/` on 2026-04-15.
**Cleaned:**
- "The following files were used as context..." boilerplate stripped from every page
- Triple blank lines collapsed
- Duplicate `---` separators removed
**Mermaid diagrams preserved** — diagram source (```mermaid ... ```) is kept verbatim.
Common types: `graph TD/LR`, `sequenceDiagram`, `classDiagram`, `erDiagram`, `stateDiagram-v2`, `mindmap`
## Repository coverage
| Source | Count |
|--------|-------|
| GitHub repos ≥ 1,000 stars | ~7,000 seeded |
| lucidrains (ML researcher, 100+ repos) | ~80 archived |
| AlpinDale (LLM inference forks) | ~50 archived |
| nisten org | ~40 archived |
| alpinelinux org | ~20 archived |
| lantos1618 | ~15 archived |
| Curated list (DeepWiki homepage + hand-picked) | ~200 seeded |
Repos not yet indexed on DeepWiki at archival time are excluded (marked missing).
## Token length distribution
See `stats.md` in this directory for full token-length histogram.
Approximate context window coverage:
- **< 2K tokens**: small repos / overview-only wikis
- **2K–32K tokens**: typical repos (most of the dataset)
- **32K–128K tokens**: large repos (Linux kernel, PyTorch, VSCode, etc.)
- **> 128K tokens**: very large monorepos (rare)
## Usage
```python
# HuggingFace datasets
from datasets import load_dataset
ds = load_dataset("your-username/deepwiki-sharegpt")
# Local parquet
import pandas as pd
df = pd.read_parquet("deepwiki_sharegpt.parquet")
print(df[df.repo == "facebook/react"].iloc[0]["gpt"][:1000])
# Local JSONL
import json
with open("deepwiki_sharegpt.jsonl") as f:
records = [json.loads(l) for l in f]
# Filter by size
small = df[df.bytes < 20_000] # < ~5K tokens
large = df[df.bytes > 100_000] # > ~25K tokens
```
```python
# Fine-tuning with TRL / transformers
from trl import SFTTrainer
from datasets import load_dataset
ds = load_dataset("your-username/deepwiki-sharegpt")
def format_record(ex):
h = ex["conversations"][0]["value"]
g = ex["conversations"][1]["value"]
return {"text": f"<|user|>\n{h}\n<|assistant|>\n{g}"}
ds = ds.map(format_record)
```
## License
**MIT** — you are free to use, modify, and redistribute with attribution.
- Dataset compilation: MIT
- Original wiki content: generated by DeepWiki from public GitHub repositories
- Each referenced GitHub repository retains its own license
## Citation
```bibtex
@dataset{deepwiki_sharegpt_2026,
title = {DeepWiki Repository Summaries (ShareGPT)},
year = {2026},
month = {4},
url = {https://huggingface.co/datasets/your-username/deepwiki-sharegpt},
license = {mit},
note = {6,920 GitHub repository wikis archived from deepwiki.com via MCP API, April 2026}
}
```
提供机构:
nisten



