quantaRoche/nasa-smd-qa-benchmark-cleaned
收藏Hugging Face2026-03-27 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/quantaRoche/nasa-smd-qa-benchmark-cleaned
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc
task_categories:
- question-answering
language:
- en
tags:
- extractive-qa
- squad_v2
- cleaned
- span-aligned
- earth-science
pretty_name: NASA-QA (Cleaned SQuAD v2 Format)
---
# NASA-QA (Cleaned, SQuAD v2 Format)
This dataset is a cleaned and reformatted derivative of:
https://huggingface.co/datasets/nasa-impact/nasa-smd-qa-benchmark
The original dataset is an extractive question answering benchmark in the Earth science domain.
This version modifies the data to ensure compatibility with SQuAD v2-style training and evaluation.
## Changes from Original
Compared to the original release, this version:
- converts the nested structure into one example per QA pair
- reformats the dataset into SQuAD v2-style schema
- removes `plausible_answers` from unanswerable examples
- adds explicit `answer_end` indices
- normalizes answer text (e.g., casing, punctuation, whitespace)
- realigns answer spans to match exact substrings in the context
- fixes incorrect or inconsistent answer offsets
-
All answerable examples are aligned such that:
> the answer text appears exactly in the context at the specified indices
## Dataset Structure
Each example contains:
- `id`: unique identifier
- `question`: question text
- `context`: paragraph context
- `answers`:
- `text`: list of answer strings
- `answer_start`: start indices
- `answer_end`: end indices
- `is_impossible`: whether the question is unanswerable
Splits:
- `train`
- `validation`
## Notes
- This dataset is intended for extractive QA tasks using SQuAD v2-style evaluation.
- Unanswerable questions are preserved using empty answer lists.
- `plausible_answers` from the original dataset were removed, as they are not required for standard SQuAD v2 training or evaluation.
## Attribution
This dataset is derived from the original NASA-QA benchmark:
- NASA SMD & IBM Research
- Paper: https://arxiv.org/abs/2405.10725
Please cite the original work when using this dataset.
## License
This dataset inherits the license from the original dataset.
提供机构:
quantaRoche



