five

AmanPriyanshu/RLVR-Env-Retrieval-Source-Retrieval-Synthetic-NVDocs-v1

收藏
Hugging Face2026-03-14 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/AmanPriyanshu/RLVR-Env-Retrieval-Source-Retrieval-Synthetic-NVDocs-v1
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-4.0 task_categories: - text-retrieval - question-answering language: - en tags: - retrieval - rlvr - search - distractor-mining size_categories: - 100K<n<1M --- # RLVR-Env-Retrieval-Source-Retrieval-Synthetic-NVDocs-v1 RLVR-ready retrieval environment derived from [nvidia/Retrieval-Synthetic-NVDocs-v1](https://huggingface.co/datasets/nvidia/Retrieval-Synthetic-NVDocs-v1). **Author:** [Aman Priyanshu](https://huggingface.co/AmanPriyanshu) ## What Is This A 100k-row retrieval QA dataset where each row contains a question, ground-truth chunks, and pre-mined distractor chunks (random + semantically similar). Designed for training and evaluating retrieval agents in an RLVR (Reinforcement Learning with Verifiable Rewards) setup — the agent searches through distractors to find the correct chunk(s). **Domain:** NVIDIA technical documentation (product specs, driver guides, research papers, developer docs) ## Source Derived from [nvidia/Retrieval-Synthetic-NVDocs-v1](https://huggingface.co/datasets/nvidia/Retrieval-Synthetic-NVDocs-v1) (15,095 documents / 105,665 QA pairs). Original license: **CC BY 4.0** — retained here. ## Schema ### qa.parquet (100,000 rows) | Column | Type | Description | |---|---|---| | `qa_id` | string | Unique ID (`nvdocs_0`, `nvdocs_1`, ...) | | `question` | string | The retrieval query | | `gt_chunks` | JSON string | List of ground-truth chunk texts. 1-10 document chunks per question (avg 2.2), mapped via segment_ids from source QA pairs | | `random_chunks` | JSON string | List of random distractor texts. ~495 random chunks from other documents (>=10 words, deduplicated against gt and similar) | | `similar_chunks` | JSON string | List of hard-negative distractor texts. ~98 semantically similar chunks via MiniLM cosine similarity (<0.97 threshold), excluding same-document chunks | ### metadata.parquet (100,000 rows) | Column | Type | Description | |---|---|---| | `qa_id` | string | Matches qa.parquet | | ... | ... | ground_truth_answer, query_type, reasoning_type, question_complexity, hop_count, segment_ids | ### chunks.parquet 120,878 document chunks with MiniLM embeddings. Kept for reference — not needed at inference time. ## Deduplication Within each row: gt > similar > random priority. No chunk text appears in more than one column per row. Similar chunks are internally deduplicated. Random chunks are filtered against both gt and similar. ## How To Use ```python import json import pyarrow.parquet as pq t = pq.read_table("qa.parquet") row = {col: t.column(col)[0].as_py() for col in t.column_names} gt = json.loads(row["gt_chunks"]) distractors = json.loads(row["random_chunks"]) + json.loads(row["similar_chunks"]) ``` ## License CC BY 4.0 (inherited from source dataset).
提供机构:
AmanPriyanshu
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作