AmanPriyanshu/RLVR-Env-Retrieval-Source-code-search-net-python

Name: AmanPriyanshu/RLVR-Env-Retrieval-Source-code-search-net-python
Creator: AmanPriyanshu
Published: 2026-03-10 05:30:40
License: 暂无描述

Hugging Face2026-03-10 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/AmanPriyanshu/RLVR-Env-Retrieval-Source-code-search-net-python

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: apache-2.0 task_categories: - text-retrieval - question-answering language: - en tags: - retrieval - rlvr - search - distractor-mining size_categories: - 100K<n<1M --- # RLVR-Env-Retrieval-Source-code-search-net-python RLVR-ready retrieval environment derived from [Nan-Do/code-search-net-python](https://huggingface.co/datasets/Nan-Do/code-search-net-python). **Author:** [Aman Priyanshu](https://huggingface.co/AmanPriyanshu) ## What Is This A 100k-row retrieval QA dataset where each row contains a question, ground-truth chunks, and pre-mined distractor chunks (random + semantically similar). Designed for training and evaluating retrieval agents in an RLVR (Reinforcement Learning with Verifiable Rewards) setup — the agent searches through distractors to find the correct chunk(s). **Domain:** Python open-source functions from GitHub (CodeSearchNet) ## Source Derived from [Nan-Do/code-search-net-python](https://huggingface.co/datasets/Nan-Do/code-search-net-python) (455,243 unique functions). Original license: **Apache 2.0** — retained here. ## Schema ### qa.parquet (100,000 rows) | Column | Type | Description | |---|---|---| | `qa_id` | string | Unique ID (`search_py_0`, `search_py_1`, ...) | | `question` | string | The retrieval query | | `gt_chunks` | JSON string | List of ground-truth chunk texts. 1 target code chunk per question (the function matching the summary) | | `random_chunks` | JSON string | List of random distractor texts. ~500 random code chunks (>=20 chars, deduplicated against gt and similar) | | `similar_chunks` | JSON string | List of hard-negative distractor texts. ~178 similar chunks via MiniLM cosine (<0.97) + char trigram edit-distance (<0.97 seq ratio), deduplicated | ### metadata.parquet (100,000 rows) | Column | Type | Description | |---|---|---| | `qa_id` | string | Matches qa.parquet | | ... | ... | chunk_idx, func_name, repo, char_count | ### chunks.parquet 455,243 code chunks with MiniLM embeddings. Kept for reference — not needed at inference time. ## Deduplication Within each row: gt > similar > random priority. No chunk text appears in more than one column per row. Similar chunks are internally deduplicated. Random chunks are filtered against both gt and similar. ## How To Use ```python import json import pyarrow.parquet as pq t = pq.read_table("qa.parquet") row = {col: t.column(col)[0].as_py() for col in t.column_names} gt = json.loads(row["gt_chunks"]) distractors = json.loads(row["random_chunks"]) + json.loads(row["similar_chunks"]) ``` ## License Apache 2.0 (inherited from source dataset).

提供机构：

AmanPriyanshu

5,000+

优质数据集

54 个

任务类型

进入经典数据集