juliensimon/stackexchange-space-qa

Name: juliensimon/stackexchange-space-qa
Creator: juliensimon
Published: 2026-04-18 11:37:08
License: 暂无描述

Hugging Face2026-04-18 更新2026-04-26 收录

下载链接：

https://hf-mirror.com/datasets/juliensimon/stackexchange-space-qa

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-sa-4.0 pretty_name: "Stack Exchange Space Q&A" language: - en description: "This dataset is a clean, tabular Q&A corpus of space and astronomy knowledge, derived from two Stack Exchange community Q&A sites: Astronomy Stack Exchange (astronomy.stackexchange.com) and Space Expl" task_categories: - question-answering - text-generation tags: - space - astronomy - question-answering - stack-exchange - instruction-tuning - sft - qa-pairs - open-data - tabular-data - parquet size_categories: - 10K<n<100K configs: - config_name: default data_files: - split: train path: data/stackexchange_space_qa.parquet default: true --- # Stack Exchange Space Q&A <div align="center"> <img src="banner.jpg" alt="The gamma-ray sky — representative of the community-written answers covering all of space science" width="400"> <p><em>Credit: NASA/DOE/Fermi LAT Collaboration</em></p> </div> *Part of a [dataset collection](https://huggingface.co/collections/juliensimon/astronomy-datasets-69c24caf2f17e36128946743) on Hugging Face.* ## Dataset description This dataset is a clean, tabular Q&A corpus of space and astronomy knowledge, derived from two Stack Exchange community Q&A sites: Astronomy Stack Exchange (astronomy.stackexchange.com) and Space Exploration Stack Exchange (space.stackexchange.com). Each row is one question paired with its best answer — either the question's accepted answer, or if none is accepted, the highest-scored answer. Unanswered questions are included with null answer fields so that downstream consumers can choose to filter them out. Stack Exchange is vote-ranked, which makes it a particularly strong source of instruction-tuning data: community scores already encode a quality signal, and accepted-answer flags add an explicit gold-standard marker. The two sites together cover the full breadth of space science — from cosmology and stellar astrophysics to orbital mechanics, rocket propulsion, satellite operations, and crewed spaceflight. Topics tend toward the graduate-level-to-working-professional end of the spectrum; questions cite papers, use technical vocabulary, and receive carefully-written answers from practitioners. The dataset is suitable for instruction fine-tuning (question → answer pairs), preference learning (score-ranked accepted vs. unaccepted pairs can be derived by joining this table with itself on qid), retrieval-augmented generation (as a grounded Q&A corpus for a space-science RAG system), benchmarking (filter by tag to build evaluation sets on a specific topic), and linguistic analysis of how practitioners explain space concepts to each other. HTML has been converted to plain text; inline formatting (code blocks, lists, equations) is preserved where possible. Content is licensed CC-BY-SA 4.0 (Stack Exchange's standard license). Each row retains the question URL so that attribution can be traced to individual authors on the source site. The dataset is refreshed annually — Stack Exchange publishes quarterly data dumps to archive.org, but the quality-ranked subset changes slowly enough that yearly is sufficient. ## Schema | Column | Type | Description | Sample | Null % | |--------|------|-------------|--------|--------| | `qid` | Int64 | Stack Exchange question ID (unique within its site) | 24590 | 0.0% | | `site` | string | Which Stack Exchange site the question is from: 'astronomy' (astronomy.stackexchange.com) or 'space' (space.stackexchange.com — Space Exploration) | astronomy | 0.0% | | `url` | string | Permalink to the question on Stack Exchange | https://astronomy.stackexchange.com/q... | 0.0% | | `question_title` | string | Question title as posted | How much gold is there in our sun? | 0.0% | | `question_body` | string | Question body in plain text (stripped of HTML, with code blocks and inline formatting preserved) | XKCD 1944 claims that there is "more ... | 0.0% | | `question_tags` | string | Semicolon-joined list of tags attached to the question (e.g., 'black-holes;general-relativity') | \|the-sun\| | 0.0% | | `question_score` | Int64 | Net vote score of the question (upvotes minus downvotes) at dump time | 150 | 0.0% | | `question_view_count` | Int64 | Number of times the question has been viewed | 26681 | 0.0% | | `question_answer_count` | Int64 | Total number of answers posted to the question | 2 | 0.0% | | `question_creation_date` | string | ISO-8601 UTC date when the question was posted | 2018-01-19T08:19:35Z | 0.0% | | `answer_body` | string | Top-scored answer body in plain text, or null if the question is unanswered. Prefers the accepted answer when one exists; otherwise the highest-scored answer. | The mass of the sun is 1.989 × 1030 k... | 14.6% | | `answer_score` | Int64 | Net vote score of the selected answer; null if unanswered | 130 | 14.6% | | `answer_creation_date` | string | ISO-8601 UTC date when the selected answer was posted; null if unanswered | 2018-01-19T08:38:05Z | 14.6% | | `answer_accepted` | boolean | True if the selected answer is the question's accepted answer; False if it is just the top-scored answer; null if unanswered | True | 14.6% | ## Quick stats - **33,519** questions across **space** (18,866), **astronomy** (14,653) - **28,628** have a top answer (**17,052** of those are the question's accepted answer) - Median question score: **4** - **52,011,395** total question views across the corpus ## Usage ```python from datasets import load_dataset ds = load_dataset("juliensimon/stackexchange-space-qa", split="train") df = ds.to_pandas() ``` ```python from datasets import load_dataset ds = load_dataset("juliensimon/stackexchange-space-qa", split="train") df = ds.to_pandas() # High-quality accepted answers only — solid instruction-tuning pairs high_quality = df[(df["answer_accepted"] == True) & (df["question_score"] >= 5)] print(f"Accepted answers on highly-voted questions: {len(high_quality):,}") # Filter by tag (e.g. exoplanets on Astronomy SE) exo = df[df["question_tags"].str.contains("exoplanet", na=False)] print(f"Exoplanet questions: {len(exo):,}") # SFT-ready Q→A pairs sft = ( df[df["answer_body"].notna()] [["qid", "site", "question_title", "question_body", "answer_body"]] .rename(columns={"question_body": "prompt_body", "answer_body": "response"}) ) # Plot top 20 tags (astronomy) import matplotlib.pyplot as plt astro = df[df["site"] == "astronomy"] tag_counts = astro["question_tags"].str.split(";").explode().value_counts().head(20) tag_counts.plot.barh(figsize=(10, 6)) plt.xlabel("Questions"); plt.title("Most-used tags on Astronomy Stack Exchange") plt.gca().invert_yaxis() plt.tight_layout(); plt.show() ``` ## Data source https://archive.org/details/stackexchange ## Update schedule Annually — SE publishes quarterly data dumps, but the accepted-answer quality-filtered subset changes slowly. ## Related datasets - [juliensimon/nasa-exoplanets](https://huggingface.co/datasets/juliensimon/nasa-exoplanets) - [juliensimon/astronaut-database](https://huggingface.co/datasets/juliensimon/astronaut-database) - [juliensimon/space-agency-database](https://huggingface.co/datasets/juliensimon/space-agency-database) - [juliensimon/hst-observations](https://huggingface.co/datasets/juliensimon/hst-observations) - [juliensimon/jwst-observations](https://huggingface.co/datasets/juliensimon/jwst-observations) > If you find this dataset useful, please consider [giving it a like](https://huggingface.co/datasets/juliensimon/stackexchange-space-qa) on Hugging Face. It helps others discover it. ## About the author Created by [Julien Simon](https://julien.org) — AI Operating Partner at Fortino Capital. Part of the [Space Datasets](https://julien.org/datasets) collection. ## Citation ```bibtex @dataset{stackexchange_space_qa, title = {Stack Exchange Space Q&A}, author = {juliensimon}, year = {2026}, url = {https://huggingface.co/datasets/juliensimon/stackexchange-space-qa}, publisher = {Hugging Face} } ``` ## License [CC-BY-SA-4.0](https://creativecommons.org/licenses/by/4.0/)

提供机构：

juliensimon

5,000+

优质数据集

54 个

任务类型

进入经典数据集