AmanPriyanshu/RLVR-Env-Retrieval-Source-Retrieval-Synthetic-NVDocs-v1
收藏Hugging Face2026-03-14 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/AmanPriyanshu/RLVR-Env-Retrieval-Source-Retrieval-Synthetic-NVDocs-v1
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-4.0
task_categories:
- text-retrieval
- question-answering
language:
- en
tags:
- retrieval
- rlvr
- search
- distractor-mining
size_categories:
- 100K<n<1M
---
# RLVR-Env-Retrieval-Source-Retrieval-Synthetic-NVDocs-v1
RLVR-ready retrieval environment derived from [nvidia/Retrieval-Synthetic-NVDocs-v1](https://huggingface.co/datasets/nvidia/Retrieval-Synthetic-NVDocs-v1).
**Author:** [Aman Priyanshu](https://huggingface.co/AmanPriyanshu)
## What Is This
A 100k-row retrieval QA dataset where each row contains a question, ground-truth chunks, and pre-mined distractor chunks (random + semantically similar). Designed for training and evaluating retrieval agents in an RLVR (Reinforcement Learning with Verifiable Rewards) setup — the agent searches through distractors to find the correct chunk(s).
**Domain:** NVIDIA technical documentation (product specs, driver guides, research papers, developer docs)
## Source
Derived from [nvidia/Retrieval-Synthetic-NVDocs-v1](https://huggingface.co/datasets/nvidia/Retrieval-Synthetic-NVDocs-v1) (15,095 documents / 105,665 QA pairs).
Original license: **CC BY 4.0** — retained here.
## Schema
### qa.parquet (100,000 rows)
| Column | Type | Description |
|---|---|---|
| `qa_id` | string | Unique ID (`nvdocs_0`, `nvdocs_1`, ...) |
| `question` | string | The retrieval query |
| `gt_chunks` | JSON string | List of ground-truth chunk texts. 1-10 document chunks per question (avg 2.2), mapped via segment_ids from source QA pairs |
| `random_chunks` | JSON string | List of random distractor texts. ~495 random chunks from other documents (>=10 words, deduplicated against gt and similar) |
| `similar_chunks` | JSON string | List of hard-negative distractor texts. ~98 semantically similar chunks via MiniLM cosine similarity (<0.97 threshold), excluding same-document chunks |
### metadata.parquet (100,000 rows)
| Column | Type | Description |
|---|---|---|
| `qa_id` | string | Matches qa.parquet |
| ... | ... | ground_truth_answer, query_type, reasoning_type, question_complexity, hop_count, segment_ids |
### chunks.parquet
120,878 document chunks with MiniLM embeddings. Kept for reference — not needed at inference time.
## Deduplication
Within each row: gt > similar > random priority. No chunk text appears in more than one column per row. Similar chunks are internally deduplicated. Random chunks are filtered against both gt and similar.
## How To Use
```python
import json
import pyarrow.parquet as pq
t = pq.read_table("qa.parquet")
row = {col: t.column(col)[0].as_py() for col in t.column_names}
gt = json.loads(row["gt_chunks"])
distractors = json.loads(row["random_chunks"]) + json.loads(row["similar_chunks"])
```
## License
CC BY 4.0 (inherited from source dataset).
提供机构:
AmanPriyanshu



