orgrctera/uda_nq_qa
收藏Hugging Face2026-03-21 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/orgrctera/uda_nq_qa
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
language:
- en
pretty_name: UDA Natural Questions (Retrieval)
size_categories:
- 1K<n<10K
tags:
- question-answering
- retrieval
- rag
- open-domain
- wikipedia
- natural-questions
- unstructured-documents
configs:
- config_name: default
data_files:
- split: default
path: data/default-*
dataset_info:
features:
- name: input
dtype: string
- name: metadata
dtype: string
- name: answers
dtype: string
- name: doc_url
dtype: string
splits:
- name: default
num_bytes: 4255499
num_examples: 2477
download_size: 1087542
dataset_size: 4255499
---
# UDA Natural Questions — NQ QA (`orgrctera/uda_nq_qa`)
## Overview
This dataset is the **Natural Questions (NQ)** slice of the **UDA (Unstructured Document Analysis)** benchmark, packaged for **retrieval-oriented** evaluation: **2,477** question–answer instances grounded in **Wikipedia** articles, with gold **long** and **short** answers plus document identifiers.
**Natural Questions** (Kwiatkowski et al., TACL 2019) is a large-scale benchmark built from **real anonymized Google Search queries**. Annotators were shown a question and a **full Wikipedia page** (from top search results) and asked to mark **long answers** (typically paragraph-level evidence) and **short answers** (concise spans or entities) when supported by the page. The task reflects **end-to-end open-domain QA**: systems must cope with long, noisy web-style documents—not a single pre-selected paragraph.
**UDA** (Hui et al., NeurIPS 2024, Datasets & Benchmarks) is a suite for **Retrieval-Augmented Generation (RAG)** and LLM-based analysis over **real-world** documents in **original** messy formats (e.g. HTML/PDF), stressing **parsing, chunking, and retrieval** alongside generation. This Hub release aligns with UDA’s **NQ QA** configuration (`sub_benchmark`: `nq_qa`): each row is a **retrieval task instance**—pipelines must **retrieve** the right Wikipedia content (or equivalent passages) and **ground** predictions in that evidence when scoring answer quality.
## Task
- **Task type:** **Retrieval** (within a RAG / document-analysis pipeline) for **Natural Questions**–style **open-domain QA** over **Wikipedia**-linked documents.
- **Input:** A natural-language question (`input`) as posed by real users.
- **Supervision / reference:** `expected_output` is a JSON string with gold **long** and **short** answers and a **`doc_url`** pointing at the Wikipedia revision used for annotation. `metadata` records UDA identifiers (`sub_benchmark`: `nq_qa`).
Evaluation typically combines **retrieval quality** (whether the model surfaces the correct article or passages) with **answer correctness** (e.g. **long-answer** F1 and **short-answer** F1 as in the original NQ leaderboard), consistent with **UDA** protocols that treat retrieval and parsing as first-class components.
## Background
### Natural Questions (NQ)
NQ targets **question answering** where the evidence is a **full Wikipedia article** rather than a single paragraph. Questions are **naturally occurring**; answers may require **reading**, **coreference**, and **span selection** at multiple granularities. The original work reports strong human agreement and uses **long** vs. **short** answer prediction as core tasks, making NQ a standard benchmark for **open-domain** and **retrieval-augmented** systems.
### UDA benchmark
UDA aggregates **2,965** real-world documents and **29,590** expert-annotated Q&A pairs across domains and query types, evaluating design choices for **LLM- and RAG-based** document analysis. The **NQ QA** track in this release contains **2,477** examples in the `default` split, packaged with UDA metadata for integration with the broader benchmark.
## Data fields
| Column | Type | Description |
|--------|------|-------------|
| `input` | `string` | Natural-language question (real user–style query). |
| `expected_output` | `string` | JSON with `answers` (`long_answer`, `short_answer`) and `doc_url` (Wikipedia page URL for the annotated revision). |
| `metadata` | struct | `benchmark_name` (`uda_nq_qa`), `benchmark_type` (`uda`), `split`, `sub_benchmark` (`nq_qa`), and `value` (JSON string with `label_key`, `label_file`, `doc_url`, `q_uid`). |
**Splits:** Single split `default` with **2,477** examples.
## Examples
The following rows are taken from the dataset (JSON in `expected_output` is formatted for readability).
**Example 1 — long and short answers (Supreme Court)**
- **`input`:** `who determines the size of the supreme court`
- **`expected_output`:**
```json
{
"answers": {
"long_answer": "Article III of the United States Constitution does not specify the number of justices. The Judiciary Act of 1789 called for the appointment of six `` judges ''. Although an 1801 act would have reduced the size of the court to five members upon its next vacancy, an 1802 act promptly negated the 1801 act, legally restoring the court 's size to six members before any such vacancy occurred. As the nation 's boundaries grew, Congress added justices to correspond with the growing number of judicial circuits: seven in 1807, nine in 1837, and ten in 1863.",
"short_answer": "Congress added"
},
"doc_url": "https://en.wikipedia.org//w/index.php?title=Supreme_Court_of_the_United_States&oldid=815925358"
}
```
**Example 2 — long answer; empty short answer**
- **`input`:** `actress who plays victoria newman on young and the restless`
- **`expected_output`:**
```json
{
"answers": {
"long_answer": "On March 21, 2005, Heinle joined the cast of the CBS soap opera The Young and the Restless, as Victoria Newman, replacing the popular Heather Tom in the role. She won a Daytime Emmy Award for Outstanding Supporting Actress in a Drama Series in 2014 and again in 2015 for the role.",
"short_answer": ""
},
"doc_url": "https://en.wikipedia.org//w/index.php?title=Amelia_Heinle&oldid=850793939"
}
```
**`metadata.value` (structure, example):**
```json
{
"label_key": "Amelia Heinle",
"label_file": "nq_qa",
"doc_url": "https://en.wikipedia.org//w/index.php?title=Amelia_Heinle&oldid=850793939",
"q_uid": "6002803770874668212"
}
```
## References
### Natural Questions (foundational dataset)
Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, Kristina Toutanova, Llion Jones, Matthew Kelcey, Ming-Wei Chang, Andrew M. Dai, Jakob Uszkoreit, Quoc Le, Slav Petrov. **Natural Questions: A Benchmark for Question Answering Research.** *Transactions of the Association for Computational Linguistics (TACL)*, 2019.
- **Paper:** [ACL Anthology Q19-1026](https://aclanthology.org/Q19-1026/)
- **Project / dataset:** [Google Research — Natural Questions](https://ai.google.com/research/NaturalQuestions/)
- **Repository:** [google-research-datasets/natural-questions](https://github.com/google-research-datasets/natural-questions)
**Abstract (summary):** Introduces a corpus of real search queries paired with Wikipedia pages; annotators mark long answers (paragraph-level) and short answers when present, enabling realistic **open-domain QA** evaluation with human upper bounds on long- and short-answer selection.
### UDA (benchmark suite containing this NQ slice)
Yulong Hui, Yao Lu, Huanchen Zhang. **UDA: A Benchmark Suite for Retrieval Augmented Generation in Real-world Document Analysis.** *NeurIPS 2024* (Datasets and Benchmarks Track). Also: [arXiv:2406.15187](https://arxiv.org/abs/2406.15187).
- **Abstract (short):** Presents UDA with thousands of real-world documents and tens of thousands of expert-annotated Q&A pairs; evaluates LLM- and RAG-based document analysis and highlights **parsing** and **retrieval** design choices across domains.
- **Code & resources:** [UDA-Benchmark](https://github.com/qinchuanhui/UDA-Benchmark)
- Aggregated UDA QA reference: [qinchuanhui/UDA-QA](https://huggingface.co/datasets/qinchuanhui/UDA-QA)
## Citation
If you use this dataset, please cite **Natural Questions**, **UDA**, and this dataset record as appropriate:
```bibtex
@article{kwiatkowski-etal-2019-natural,
title = {Natural Questions: A Benchmark for Question Answering Research},
author = {Kwiatkowski, Tom and Palomaki, Jennimaria and Redfield, Olivia and Collins, Michael and Parikh, Ankur and Alberti, Chris and Epstein, Danielle and Polosukhin, Illia and Devlin, Jacob and Lee, Kenton and Toutanova, Kristina and Jones, Llion and Kelcey, Matthew and Chang, Ming-Wei and Dai, Andrew M. and Uszkoreit, Jakob and Le, Quoc and Petrov, Slav},
journal = {Transactions of the Association for Computational Linguistics},
volume = {7},
year = {2019},
pages = {452--466},
doi = {10.1162/tacl_a_00276},
url = {https://aclanthology.org/Q19-1026/}
}
```
```bibtex
@inproceedings{hui2024uda,
title = {UDA: A Benchmark Suite for Retrieval Augmented Generation in Real-world Document Analysis},
author = {Yulong Hui and Yao Lu and Huanchen Zhang},
booktitle = {Advances in Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track},
year = {2024}
}
```
## License
The original **Natural Questions** data is released by Google under terms described in the [official repository](https://github.com/google-research-datasets/natural-questions) and [project page](https://ai.google.com/research/NaturalQuestions/). Use this dataset in compliance with those terms and with the **UDA** benchmark conditions. This Hub card is provided for documentation; verify licenses for your use case before redistribution or commercial use.
提供机构:
orgrctera



