juliensimon/stackexchange-space-qa
收藏Hugging Face2026-04-18 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/juliensimon/stackexchange-space-qa
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-sa-4.0
pretty_name: "Stack Exchange Space Q&A"
language:
- en
description: "This dataset is a clean, tabular Q&A corpus of space and astronomy knowledge, derived from two Stack Exchange community Q&A sites: Astronomy Stack Exchange (astronomy.stackexchange.com) and Space Expl"
task_categories:
- question-answering
- text-generation
tags:
- space
- astronomy
- question-answering
- stack-exchange
- instruction-tuning
- sft
- qa-pairs
- open-data
- tabular-data
- parquet
size_categories:
- 10K<n<100K
configs:
- config_name: default
data_files:
- split: train
path: data/stackexchange_space_qa.parquet
default: true
---
# Stack Exchange Space Q&A
<div align="center">
<img src="banner.jpg" alt="The gamma-ray sky — representative of the community-written answers covering all of space science" width="400">
<p><em>Credit: NASA/DOE/Fermi LAT Collaboration</em></p>
</div>
*Part of a [dataset collection](https://huggingface.co/collections/juliensimon/astronomy-datasets-69c24caf2f17e36128946743) on Hugging Face.*
## Dataset description
This dataset is a clean, tabular Q&A corpus of space and astronomy knowledge, derived from two Stack Exchange community Q&A sites: Astronomy Stack Exchange (astronomy.stackexchange.com) and Space Exploration Stack Exchange (space.stackexchange.com). Each row is one question paired with its best answer — either the question's accepted answer, or if none is accepted, the highest-scored answer. Unanswered questions are included with null answer fields so that downstream consumers can choose to filter them out.
Stack Exchange is vote-ranked, which makes it a particularly strong source of instruction-tuning data: community scores already encode a quality signal, and accepted-answer flags add an explicit gold-standard marker. The two sites together cover the full breadth of space science — from cosmology and stellar astrophysics to orbital mechanics, rocket propulsion, satellite operations, and crewed spaceflight. Topics tend toward the graduate-level-to-working-professional end of the spectrum; questions cite papers, use technical vocabulary, and receive carefully-written answers from practitioners.
The dataset is suitable for instruction fine-tuning (question → answer pairs), preference learning (score-ranked accepted vs. unaccepted pairs can be derived by joining this table with itself on qid), retrieval-augmented generation (as a grounded Q&A corpus for a space-science RAG system), benchmarking (filter by tag to build evaluation sets on a specific topic), and linguistic analysis of how practitioners explain space concepts to each other. HTML has been converted to plain text; inline formatting (code blocks, lists, equations) is preserved where possible.
Content is licensed CC-BY-SA 4.0 (Stack Exchange's standard license). Each row retains the question URL so that attribution can be traced to individual authors on the source site. The dataset is refreshed annually — Stack Exchange publishes quarterly data dumps to archive.org, but the quality-ranked subset changes slowly enough that yearly is sufficient.
## Schema
| Column | Type | Description | Sample | Null % |
|--------|------|-------------|--------|--------|
| `qid` | Int64 | Stack Exchange question ID (unique within its site) | 24590 | 0.0% |
| `site` | string | Which Stack Exchange site the question is from: 'astronomy' (astronomy.stackexchange.com) or 'space' (space.stackexchange.com — Space Exploration) | astronomy | 0.0% |
| `url` | string | Permalink to the question on Stack Exchange | https://astronomy.stackexchange.com/q... | 0.0% |
| `question_title` | string | Question title as posted | How much gold is there in our sun? | 0.0% |
| `question_body` | string | Question body in plain text (stripped of HTML, with code blocks and inline formatting preserved) | XKCD 1944 claims that there is "more ... | 0.0% |
| `question_tags` | string | Semicolon-joined list of tags attached to the question (e.g., 'black-holes;general-relativity') | \|the-sun\| | 0.0% |
| `question_score` | Int64 | Net vote score of the question (upvotes minus downvotes) at dump time | 150 | 0.0% |
| `question_view_count` | Int64 | Number of times the question has been viewed | 26681 | 0.0% |
| `question_answer_count` | Int64 | Total number of answers posted to the question | 2 | 0.0% |
| `question_creation_date` | string | ISO-8601 UTC date when the question was posted | 2018-01-19T08:19:35Z | 0.0% |
| `answer_body` | string | Top-scored answer body in plain text, or null if the question is unanswered. Prefers the accepted answer when one exists; otherwise the highest-scored answer. | The mass of the sun is 1.989 × 1030 k... | 14.6% |
| `answer_score` | Int64 | Net vote score of the selected answer; null if unanswered | 130 | 14.6% |
| `answer_creation_date` | string | ISO-8601 UTC date when the selected answer was posted; null if unanswered | 2018-01-19T08:38:05Z | 14.6% |
| `answer_accepted` | boolean | True if the selected answer is the question's accepted answer; False if it is just the top-scored answer; null if unanswered | True | 14.6% |
## Quick stats
- **33,519** questions across **space** (18,866), **astronomy** (14,653)
- **28,628** have a top answer (**17,052** of those are the question's accepted answer)
- Median question score: **4**
- **52,011,395** total question views across the corpus
## Usage
```python
from datasets import load_dataset
ds = load_dataset("juliensimon/stackexchange-space-qa", split="train")
df = ds.to_pandas()
```
```python
from datasets import load_dataset
ds = load_dataset("juliensimon/stackexchange-space-qa", split="train")
df = ds.to_pandas()
# High-quality accepted answers only — solid instruction-tuning pairs
high_quality = df[(df["answer_accepted"] == True) & (df["question_score"] >= 5)]
print(f"Accepted answers on highly-voted questions: {len(high_quality):,}")
# Filter by tag (e.g. exoplanets on Astronomy SE)
exo = df[df["question_tags"].str.contains("exoplanet", na=False)]
print(f"Exoplanet questions: {len(exo):,}")
# SFT-ready Q→A pairs
sft = (
df[df["answer_body"].notna()]
[["qid", "site", "question_title", "question_body", "answer_body"]]
.rename(columns={"question_body": "prompt_body", "answer_body": "response"})
)
# Plot top 20 tags (astronomy)
import matplotlib.pyplot as plt
astro = df[df["site"] == "astronomy"]
tag_counts = astro["question_tags"].str.split(";").explode().value_counts().head(20)
tag_counts.plot.barh(figsize=(10, 6))
plt.xlabel("Questions"); plt.title("Most-used tags on Astronomy Stack Exchange")
plt.gca().invert_yaxis()
plt.tight_layout(); plt.show()
```
## Data source
https://archive.org/details/stackexchange
## Update schedule
Annually — SE publishes quarterly data dumps, but the accepted-answer quality-filtered subset changes slowly.
## Related datasets
- [juliensimon/nasa-exoplanets](https://huggingface.co/datasets/juliensimon/nasa-exoplanets)
- [juliensimon/astronaut-database](https://huggingface.co/datasets/juliensimon/astronaut-database)
- [juliensimon/space-agency-database](https://huggingface.co/datasets/juliensimon/space-agency-database)
- [juliensimon/hst-observations](https://huggingface.co/datasets/juliensimon/hst-observations)
- [juliensimon/jwst-observations](https://huggingface.co/datasets/juliensimon/jwst-observations)
> If you find this dataset useful, please consider [giving it a like](https://huggingface.co/datasets/juliensimon/stackexchange-space-qa) on Hugging Face. It helps others discover it.
## About the author
Created by [Julien Simon](https://julien.org) — AI Operating Partner at Fortino Capital. Part of the [Space Datasets](https://julien.org/datasets) collection.
## Citation
```bibtex
@dataset{stackexchange_space_qa,
title = {Stack Exchange Space Q&A},
author = {juliensimon},
year = {2026},
url = {https://huggingface.co/datasets/juliensimon/stackexchange-space-qa},
publisher = {Hugging Face}
}
```
## License
[CC-BY-SA-4.0](https://creativecommons.org/licenses/by/4.0/)
提供机构:
juliensimon



