five

Omarrran/stackpulse_qa_output

收藏
Hugging Face2026-04-19 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/Omarrran/stackpulse_qa_output
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 language: - en pretty_name: "StackPulse-QA: Instruction-Tuning Q&A Pairs from Stack Overflow" size_categories: - 100K<n<1M task_categories: - question-answering - text-generation - text2text-generation tags: - stackoverflow - instruction-tuning - qa - code - fine-tuning - alpaca-format - llm-training --- # 🧩 StackPulse-QA: Instruction-Tuning Q&A Pairs from Stack Overflow ## Dataset Summary Instruction-tuning Q&A dataset built from [Omarrran/StackPulse_778K_QnA_Code_dataset](https://huggingface.co/datasets/Omarrran/StackPulse_778K_QnA_Code_dataset) by joining question IDs with **BigQuery `bigquery-public-data.stackoverflow.posts_answers`** on `accepted_answer_id`. Each sample consists of: - `input_text_instruct` — A question (title + body) prefixed with an instruction - `output_text` — The **accepted answer** from Stack Overflow Format mirrors the instruction-tuning dataset from DeepLearning.AI's *Finetuning Large Language Models* course, ready for fine-tuning PaLM, LLaMA, Mistral, Gemma, Phi, and similar models. --- ## 📊 Processing Progress - **Runs completed** : 4 / 6 - **Questions processed** : 400,000 / 554,196 - **Remaining** : 154,196 --- ## 📁 Files in This Dataset ### 🏋️ Training Files (80% split) | File | Format | Description | |------|--------|-------------| | data/tune_data_stack_overflow_python_qa_run1-07:19:04:2026.jsonl | JSONL | Training split from 1 | | data/tune_data_stack_overflow_python_qa_run2-07:19:04:2026.jsonl | JSONL | Training split from 2 | | data/tune_data_stack_overflow_python_qa_run3-07:19:04:2026.jsonl | JSONL | Training split from 3 | | data/tune_data_stack_overflow_python_qa_run4-07:19:04:2026.jsonl | JSONL | Training split from 4 | | data/tune_data_stack_overflow_python_qa_run5-07:19:04:2026.jsonl | JSONL | Training split from 5 | ### 🧪 Evaluation Files (20% split) | File | Format | Description | |------|--------|-------------| | data/tune_eval_data_stack_overflow_python_qa_run1-07:19:04:2026.jsonl | JSONL | Eval split from run 1 | | data/tune_eval_data_stack_overflow_python_qa_run2-07:19:04:2026.jsonl | JSONL | Eval split from run 2 | | data/tune_eval_data_stack_overflow_python_qa_run3-07:19:04:2026.jsonl | JSONL | Eval split from run 3 | | data/tune_eval_data_stack_overflow_python_qa_run4-07:19:04:2026.jsonl | JSONL | Eval split from run 4 | ### 📄 Full Metadata CSVs | File | Format | Description | |------|--------|-------------| | data/stackpulse_qa_full_run1-07:19:04:2026.csv | CSV | Full metadata for run 1 | | data/stackpulse_qa_full_run2-07:19:04:2026.csv | CSV | Full metadata for run 2 | | data/stackpulse_qa_full_run3-07:19:04:2026.csv | CSV | Full metadata for run 3 | | data/stackpulse_qa_full_run4-07:19:04:2026.csv | CSV | Full metadata for run 4 | --- ## 🏗️ Schema ### JSONL Files (training / eval) Exactly 2 fields per row — ready for instruction fine-tuning: | Field | Type | Description | |-------|------|-------------| | `input_text_instruct` | string | Instruction prefix + question title + question body | | `output_text` | string | Accepted answer body (HTML format) | ### CSV Files (full metadata) | Column | Description | |--------|-------------| | question_id | Stack Overflow question ID | | input_text | title + body (no instruction prefix) | | output_text | accepted answer body | | input_text_instruct | instruction-prefixed input (same as JSONL) | | title | question title only | | tags | pipe-separated tags | | q_score | question upvote score | | view_count | total views | | answer_count | number of answers | | accepted_answer_id | ID of the accepted answer | | answer_id | ID of this answer (= accepted_answer_id) | | a_score | answer upvote score | | is_accepted | always True (we only keep accepted answers) | | creation_date | question creation timestamp | --- ## 🚀 Quick Start ### Load with pandas ```python import pandas as pd # Training data train = pd.read_json("data/tune_data_stack_overflow_python_qa_run1-*.jsonl", lines=True) # Eval data eval_ = pd.read_json("data/tune_eval_data_stack_overflow_python_qa_run1-*.jsonl", lines=True) print(train.iloc[0]["input_text_instruct"][:300]) print(train.iloc[0]["output_text"][:300]) ``` ### Load with HuggingFace `datasets` ```python from datasets import load_dataset # Load all training shards ds = load_dataset( "json", data_files={ "train": "data/tune_data_stack_overflow_python_qa_run*.jsonl", "eval" : "data/tune_eval_data_stack_overflow_python_qa_run*.jsonl", } ) print(ds) ``` ### Use for fine-tuning (Alpaca-style) ```python def format_prompt(ex): return { "text": f"{ex['input_text_instruct']}\n\n### Response:\n{ex['output_text']}" } train_formatted = ds["train"].map(format_prompt) ``` --- ## 📋 Instruction Template Used Please answer the following Stackoverflow question on Programming. Answer it like you are a developer answering Stackoverflow questions. Stackoverflow question: {title}{body} --- ## ⚠️ Caveats 1. **HTML in answers**: `output_text` contains raw HTML tags (`<p>`, `<pre>`, `<code>`). Strip or preserve depending on your use case. 2. **Accepted answers only**: We filter `q.accepted_answer_id = a.id` — other community answers are skipped. 3. **~60% match rate**: Of each 100K question IDs queried, ~60K have accepted answers in BigQuery. The rest are self-answered, deleted, or lack acceptance. 4. **80/20 split**: Each run uses `random_state=42` for reproducible train/eval splits. 5. **Mirrors L2_data.ipynb**: Format exactly matches DeepLearning.AI's *Finetuning Large Language Models* course notebook structure. --- ## 🔁 Source Dataset Question IDs and metadata sourced from: - [Omarrran/StackPulse_778K_QnA_Code_dataset](https://huggingface.co/datasets/Omarrran/StackPulse_778K_QnA_Code_dataset) Answers joined from: - `bigquery-public-data.stackoverflow.posts_answers` (Google BigQuery Public Dataset) --- ## 📋 Citation ```bibtex @dataset{malik2026stackpulseqa, author = {Malik, Omar Haq Nawaz}, title = {StackPulse-QA: Instruction-Tuning Q&A Pairs from Stack Overflow}, year = {2026}, publisher = {HuggingFace}, url = {https://huggingface.co/datasets/Omarrran/stackpulse_qa_output}, license = {Apache-2.0} } ``` --- ## 👤 Author **Omar Haq Nawaz Malik** (HuggingFace: [Omarrran](https://huggingface.co/Omarrran)) AI Engineer & NLP Researcher | BITS Pilani | Srinagar, Kashmir
提供机构:
Omarrran
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作