five

Omarrran/StackPulse_778K_QnA_Code_dataset

收藏
Hugging Face2026-04-19 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/Omarrran/StackPulse_778K_QnA_Code_dataset
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 language: - en pretty_name: "StackPulse-778K: Developer Q&A & Code Dataset" size_categories: - 100K<n<1M task_categories: - text-classification - question-answering - text-generation - feature-extraction - zero-shot-classification tags: - stackoverflow - code - nlp - question-answering - programming - python - javascript - java - developer - community-qa - html - tags configs: - config_name: full data_files: - split: train path: data/stackoverflow_778k_full.jsonl - config_name: unanswered data_files: - split: train path: data/stackoverflow_unanswered.jsonl - config_name: with_code data_files: - split: train path: data/stackoverflow_with_code.jsonl - config_name: high_quality data_files: - split: train path: data/stackoverflow_high_quality.jsonl - config_name: python data_files: - split: train path: data/stackoverflow_python.jsonl - config_name: javascript data_files: - split: train path: data/stackoverflow_javascript.jsonl - config_name: java data_files: - split: train path: data/stackoverflow_java.jsonl - config_name: csharp data_files: - split: train path: data/stackoverflow_csharp.jsonl - config_name: android data_files: - split: train path: data/stackoverflow_android.jsonl --- # 💻 StackOverflow-778K: Multi-Year Developer Q&A Dataset ## Dataset Summary A large-scale Stack Overflow question dataset containing **778,929 unique questions** sampled across **7 years (2015–2022)**. Each question includes the raw HTML body, plain-text version, tags, score, view count, answer count, and a rich set of derived features for immediate ML use. Collected across **8 sampling runs** on Feb 27 2026, deduplicated to **778,929 unique questions** with only 2 duplicates removed. --- ## 📁 Files in This Dataset | File | Format | Rows | Description | |------|--------|------|-------------| | stackoverflow_778k_full.csv | CSV | 778,929 | Complete dataset | | stackoverflow_unanswered.csv | CSV | 224,733 | Questions with 0 answers | | stackoverflow_with_code.csv | CSV | 595,679 | Questions containing code blocks | | stackoverflow_high_quality.csv | CSV | 20,205 | Score ≥ 5 and answered | | stackoverflow_python.csv | CSV | 107,083 | Python-tagged questions | | stackoverflow_javascript.csv | CSV | 87,367 | JavaScript-tagged questions | | stackoverflow_java.csv | CSV | 54,077 | Java-tagged questions | | stackoverflow_csharp.csv | CSV | 40,420 | C#-tagged questions | | stackoverflow_android.csv | CSV | 42,004 | Android-tagged questions | | *.jsonl versions | JSONL | same | HuggingFace-native format for all above | --- ## 🏗️ Schema Reference | Column | Type | Description | |--------|------|-------------| | id | int64 | Unique Stack Overflow question ID | | title | string | Question title | | question_body | string | Raw HTML body (includes `<pre><code>` blocks) | | body_text | string | Plain text body (HTML stripped, code replaced with [CODE]) | | tags | string | Pipe-separated tags e.g. `python\|pandas\|dataframe` | | score | int64 | Net upvotes (can be negative) | | creation_date | string | ISO 8601 UTC creation timestamp | | year | int32 | Year extracted from creation_date | | month | int32 | Month (1–12) | | hour | int32 | Hour of day (0–23, UTC) | | dayofweek | int32 | Day of week (0=Monday, 6=Sunday) | | view_count | int64 | Total question views | | answer_count | int64 | Number of answers received | | body_len | int64 | Plain text body character length | | title_len | int64 | Title character length | | tag_count | int64 | Number of tags (1–5) | | code_block_cnt | int64 | Number of `<pre>` code blocks in body | | has_code | bool | True if question contains at least one code block | | is_unanswered | bool | True if answer_count == 0 | | is_popular | bool | True if view_count > 95th percentile (~2,599) | | is_viral | bool | True if view_count > 99th percentile (~11,513) | | is_highly_voted | bool | True if score >= 10 | | is_negative | bool | True if score < 0 | | score_bucket | string | "negative" / "zero" / "low" / "medium" / "high" | --- ## 📈 Dataset Statistics ### Overview - **Total questions**: 778,929 - **Date range**: 2015-02-14 → 2022-09-25 - **Missing years**: 2019, 2021 (sampling gaps) - **Unique tags**: 41,754 - **Zero nulls** in all core columns (2 questions have empty tags) ### Score Distribution - Mean: 0.69 | Median: 0 | Std: 4.77 - Range: -27 to 1,061 - Negative score: 63,415 (8.14%) - Zero score: 461,977 (59.31%) — majority never upvoted - Score ≥ 10: 7,431 (0.95%) - Score ≥ 100: 246 (0.032%) ### View Count Distribution - Mean: 794 | Median: 66 | P95: 2,599 | P99: 11,513 - Max: 915,870 views - Popular (>P95): 38,944 (5.00%) - Viral (>P99): 7,789 (1.00%) ### Answer Count - Unanswered: 224,733 (28.85%) - 1 answer: 387,399 (49.73%) - 2+ answers: 166,797 (21.42%) - Max answers on a single question: 36 ### Question Body - Has code block: 595,679 (76.47%) - Avg code blocks per question: 1.56 - Avg body length: 557 chars | Median: 447 chars - Avg title length: 59 chars ### Tags - Avg tags per question: 3.01 - 5 tags (SO max): 117,671 (15.1%) - 1 tag: 92,409 (11.9%) ### Questions Per Year - -1: 2,694 - 2015: 99,664 - 2016: 120,887 - 2017: 20,791 - 2018: 20,096 - 2020: 99,676 - 2022: 415,121 *(Note: 2017/2018 low counts reflect sampling focus; 2022 dominates at 53%)* ### Top 10 Tags - python (98,154) - javascript (87,156) - java (53,052) - c# (40,178) - android (38,543) - html (36,572) - php (34,656) - reactjs (31,727) - css (24,770) - r (21,694) ### Unanswered Rate by Top Tags - node.js: 36.46% unanswered (hardest to answer) - reactjs: 35.31% - flutter: 33.44% - typescript: 31.19% - android: 30.49% - python: 28.69% - java: 26.88% - css: 19.59% (easiest to get answered) - jquery: 18.30% --- ## ⚠️ Known Issues & Caveats 1. **YEAR GAPS**: Years 2019 and 2021 are absent — this is a sampling artifact, not a gap in SO activity. Do not use for temporal trend analysis without noting this. 2. **2022 DOMINANCE**: 415,121 questions (53%) are from 2022. The dataset skews heavily toward recent questions. Stratify by year if balance matters. 3. **RAW HTML**: `question_body` contains raw HTML including `&lt;`, `&gt;`, `<pre><code>` blocks. Use `body_text` for NLP tasks. Use `question_body` for HTML-aware or code-extraction tasks. 4. **SCORE SKEW**: 59.3% of questions have score=0. Mean (0.69) is misleading. Use `score_bucket` or `is_highly_voted` for classification tasks. 5. **VIEW COUNT SKEW**: Mean (794) is 12× the median (66) due to viral questions. Use log-transformed view_count for regression tasks. 6. **PIPE-SEPARATED TAGS**: The `tags` column uses `|` as delimiter e.g. `python|pandas|dataframe`. Split with `str.split("|")` before use. 7. **CODE PLACEHOLDER**: In `body_text`, all `<pre>...</pre>` blocks are replaced with the token `[CODE]`. The original HTML is preserved in `question_body`. 8. **DUPLICATE IDs**: 2 exact duplicates were found and removed during processing. --- ## 🚀 Quick Start ### pandas ```python import pandas as pd # Full dataset df = pd.read_csv("data/stackoverflow_778k_full.csv") # High quality only (score >= 5, answered) hq = pd.read_csv("data/stackoverflow_high_quality.csv") # Python questions only py = pd.read_csv("data/stackoverflow_python.csv") # Unanswered questions (good for difficulty modeling) ua = pd.read_csv("data/stackoverflow_unanswered.csv") # Split tags into list df["tag_list"] = df["tags"].str.split("|") # Filter by year df_2022 = df[df["year"] == 2022] # Log-transform view count for regression import numpy as np df["log_views"] = np.log1p(df["view_count"]) ``` ### HuggingFace datasets ```python from datasets import load_dataset REPO = "Omarrran/StackPulse_778K_QnA_Code_dataset" # Full 778K ds = load_dataset(REPO, "full") # High quality only hq = load_dataset(REPO, "high_quality") # Python questions py = load_dataset(REPO, "python") # Unanswered questions ua = load_dataset(REPO, "unanswered") # Convert to pandas df = ds["train"].to_pandas() ``` --- ## 🔬 Suggested Research Tasks | Task | Config | Key Columns | |------|--------|-------------| | Answer prediction (binary) | full | title, body_text, tags → is_unanswered | | Score regression | full | title, body_text, tags → score | | View count prediction | full | title, tags, score → log(view_count) | | Tag recommendation | full | title, body_text → tags | | Code vs no-code classification | full | body_text → has_code | | Question quality scoring | full | title, body_text → score_bucket | | LLM fine-tuning (Q&A) | high_quality | title + body_text as prompt | | Difficulty estimation | full | tags → unanswered rate per tag | | Time-of-day analysis | full | hour, dayofweek → view_count / score | | Language-specific modeling | python/javascript/java | any | --- ## 📋 Citation ```bibtex @dataset{malik2026stackoverflow, author = {Malik, Omar Haq Nawaz}, title = {StackOverflow-778K: Multi-Year Developer Q&A Dataset}, year = {2026}, publisher = {HuggingFace}, url = {https://huggingface.co/datasets/Omarrran/StackPulse_778K_QnA_Code_dataset}, questions = {778929}, years = {2015-2022}, license = {Apache-2.0} } ``` --- ## 👤 Author **Omar Haq Nawaz Malik** (HuggingFace: [Omarrran](https://huggingface.co/Omarrran)) AI Engineer & NLP Researcher | BITS Pilani | Srinagar, Kashmir
提供机构:
Omarrran
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作