five

nuhmanpk/dev-knowledge-base

收藏
Hugging Face2026-03-23 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/nuhmanpk/dev-knowledge-base
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: mit dataset_info: features: - name: title dtype: string - name: source dtype: string - name: url dtype: string - name: category dtype: string - name: language dtype: string - name: content dtype: string - name: chunk_id dtype: int64 - name: chunk_length dtype: int64 - name: last_updated dtype: string splits: - name: train num_bytes: 401051216 num_examples: 426107 - name: test num_bytes: 941198 num_examples: 1000 download_size: 180107389 dataset_size: 401992414 configs: - config_name: default data_files: - split: train path: data/train-* - split: test path: data/test-* task_categories: - question-answering - summarization - text-generation language: - en tags: - code pretty_name: 'DevBase ' size_categories: - 100K<n<1M --- # Dev Knowledge Base (Programming Documentation Dataset) A large-scale, structured dataset of programming documentation collected from official sources across languages, frameworks, tools, and AI ecosystems. Do Follow me on Github: https://github.com/nuhmanpk --- ## Overview This dataset contains cleaned and structured documentation content scraped from official developer docs across multiple domains such as: * Programming languages * Frameworks (frontend, backend) * DevOps & infrastructure tools * Databases * Machine learning & AI libraries All content is chunked (~800 characters) and optimized for: * Retrieval-Augmented Generation (RAG) * Developer copilots * Code assistants * Semantic search --- ## Dataset Structure Each row represents a chunk of documentation. | Column | Description | | ------------ | ------------------------------------------ | | title | Page title or endpoint | | source | Source name (e.g., react, python, fastapi) | | url | Original documentation URL | | category | Type (language, framework, database, etc.) | | language | Programming language | | content | Cleaned text chunk | | chunk_id | Chunk index within page | | chunk_length | Character length | | last_updated | Timestamp | --- ## Sources Included ### Languages python, javascript, typescript, go, rust, java, csharp, dart, swift, kotlin ### Frontend & Frameworks react, nextjs, vue, nuxt, svelte, sveltekit, angular, astro, qwik, solidjs ### Backend & APIs fastapi, django, flask, express, nestjs, hono, elysia ### Runtime & Tooling nodejs, deno, bun, vite, webpack, turborepo, nx, pnpm, biome ### UI Libraries tailwind, shadcn_ui, chakra_ui, mui ### Mobile & Desktop react_native, expo, flutter, tauri, electron ### Machine Learning & AI numpy, pandas, pytorch, tensorflow, scikit_learn, xgboost, lightgbm transformers, langchain, llamaindex, openai, vllm, ollama, haystack mastra, pydantic_ai, langfuse, mcp ### Databases postgresql, mysql, sqlite, mongodb, redis, supabase, firebase planetscale, neon, convex, drizzle_orm, qdrant, turso ### DevOps & Infrastructure docker, kubernetes, terraform, ansible github_actions, gitlab_ci, git, opentelemetry, inngest, temporal ### Other claude_agent_sdk Full crawl configuration available here: --- ## Chunk Distribution Example distribution after cleaning and removing Zig: | Source | Chunks | | ------------ | -------- | | python | ~15,000 | | javascript | ~4,000 | | go | ~8,000 | | react | ~3,000 | | nextjs | ~4,000 | | docker | ~4,000 | | kubernetes | ~14,000 | | transformers | ~14,000 | | firebase | ~300,000 | | redis | ~17,000 | | git | ~14,000 | | flutter | ~14,000 | | supabase | ~10,000 | Total: **millions of chunks across 80+ sources** --- ## How to Use (Hugging Face) ### Install ```bash pip install datasets ``` ### Load Dataset ```python from datasets import load_dataset dataset = load_dataset("nuhmanpk/dev-knowledge-base") print(dataset["train"][0]) ``` --- ## Example Use Cases ### 1. Semantic Search ```python from sentence_transformers import SentenceTransformer import numpy as np model = SentenceTransformer("all-MiniLM-L6-v2") docs = [x["content"] for x in dataset["train"][:1000]] embeddings = model.encode(docs) query = "how to build api with fastapi" q_emb = model.encode([query]) scores = np.dot(embeddings, q_emb.T).squeeze() print(docs[scores.argmax()]) ``` --- ### 2. RAG Pipeline ```text User Query → Embed → Vector DB → Retrieve → LLM → Answer ``` Use with: * FAISS * Qdrant * Pinecone --- ### 3. Fine-tuning Convert to instruction format: ```json { "instruction": "Explain JWT authentication", "input": "", "output": "<documentation chunk>" } ``` --- ### 4. Developer Chatbot Build: * AI coding assistant * StackOverflow-style search * Internal dev knowledge system --- ## Data Processing Pipeline * Async crawling with rate limiting * HTML parsing (BeautifulSoup) * Navigation/content filtering * Chunking (~800 chars) * Cleaning & binary removal Crawler implementation: --- ## Limitations * Some duplicate content may exist * Chunk-level context only (not full pages) * No semantic labeling yet * Some sources larger than others --- ## Future Improvements * Deduplication * Better chunking (semantic splitting) * Q/A generation * Code extraction * Metadata enrichment --- ## License This dataset is built from publicly available documentation. Refer to individual sources for licensing. --- ## Author https://github.com/nuhmanpk --- ## Quick Example ```python from datasets import load_dataset ds = load_dataset("nuhmanpk/dev-knowledge-base") for row in ds["train"].select(range(3)): print(row["source"], "→", row["content"][:150]) ``` --- ## Summary A large, structured, and practical dataset for building developer-focused AI systems from code assistants to full RAG pipelines. ---
提供机构:
nuhmanpk
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作