castorini/NanoKnow_Benchmark

Name: castorini/NanoKnow_Benchmark
Creator: castorini
Published: 2026-02-26 01:49:21
License: 暂无描述

Hugging Face2026-02-26 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/castorini/NanoKnow_Benchmark

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: apache-2.0 task_categories: - question-answering tags: - nanoknow - qrels - nanochat - fineweb - knowledge-probing - parametric-knowledge arxiv: "2602.20122" size_categories: - 10K<n<100K --- # NanoKnow Benchmark Qrels [[Paper]](https://arxiv.org/abs/2602.20122) [[Code]](https://github.com/castorini/NanoKnow) Pre-built **relevance judgments (qrels)** that partition [SQuAD](https://rajpurkar.github.io/SQuAD-explorer/) and [Natural Questions (NQ)](https://ai.google.com/research/NaturalQuestions) into **supported** and **unsupported** splits based on whether the answer appears in the [nanochat](https://github.com/karpathy/nanochat) pre-training corpus ([karpathy/fineweb-edu-100b-shuffle](https://huggingface.co/datasets/karpathy/fineweb-edu-100b-shuffle)). These qrels are part of the **NanoKnow** project: [github.com/castorini/NanoKnow](https://github.com/castorini/NanoKnow) ## Splits | Dataset | Total Questions | Supported | Unsupported | |---------|----------------|-----------|-------------| | SQuAD | 10,570 | 7,560 (72%) | 3,010 (28%) | | NQ-Open | 3,610 | 2,391 (66%) | 1,219 (34%) | - **Supported** — The gold answer was found in the pre-training corpus and verified by an LLM judge. These questions test *parametric knowledge*. - **Unsupported** — The gold answer does not appear in the pre-training corpus. These questions test the model's ability to generalize or rely on *external knowledge* (RAG). ## Files | File | Description | Format | |------|-------------|--------| | `qrels/squad_supported.txt` | SQuAD supported questions (7,560 questions, 145,918 verified docs) | `qid, question, answer, doc_id, answer_offset` | | `qrels/squad_unsupported.txt` | SQuAD unsupported questions (3,010 questions) | `qid, question, answer` | | `qrels/nq_supported.txt` | NQ supported questions (2,391 questions, 56,857 verified docs) | `qid, question, answer, doc_id, answer_offset` | | `qrels/nq_unsupported.txt` | NQ unsupported questions (1,219 questions) | `qid, question, answer` | ## File Format **Supported qrels** map each question to one or more pre-training documents that contain a verified answer: ``` qid, question, official_answer, doc_id, answer_offset ``` - `doc_id`: Document identifier in the format `shard_XXXXX_YYYYY` (shard number and row offset within the FineWeb-Edu parquet files). - `answer_offset`: Character offset of the answer string within the document. **Unsupported qrels** list questions whose answers were not found in the corpus: ``` qid, question, official_answer ``` ## Pipeline The qrels were generated using a three-stage pipeline: 1. **BM25 Retrieval** — Search the corpus for the top-100 candidate documents per question using [Pyserini](https://github.com/castorini/pyserini). 2. **Answer String Matching** — Filter to documents containing the gold answer as a substring. 3. **LLM Verification** — Use Qwen/Qwen3-8B as a judge to filter out coincidental matches (e.g., "Paris" in a passage about Paris, Texas). ## Usage ### Download ```bash huggingface-cli download LingweiGu/NanoKnow_Benchmark --repo-type dataset --local-dir ./nanoknow-benchmark ``` ### Load in Python ```python import csv def load_supported_qrels(filepath): qrels = [] with open(filepath) as f: for line in f: if line.startswith("#") or not line.strip(): continue parts = [p.strip() for p in line.split(",")] qrels.append({ "qid": int(parts[0]), "question": parts[1], "answer": parts[2], "doc_id": parts[3], "answer_offset": int(parts[4]), }) return qrels squad_supported = load_supported_qrels("nanoknow-benchmark/qrels/squad_supported.txt") print(f"Loaded {len(squad_supported)} supported entries") ``` ## Related Resources - **Lucene Index**: [LingweiGu/NanoKnow-Fineweb-Edu-Index](https://huggingface.co/datasets/LingweiGu/NanoKnow-Fineweb-Edu-Index) — The pre-built BM25 index over the FineWeb-Edu corpus (~326 GB) used to generate these qrels. - **Code**: [github.com/castorini/NanoKnow](https://github.com/castorini/NanoKnow) — Scripts to project new benchmarks, evaluate nanochat checkpoints, and analyze frequency effects. ## Citation ```bibtex @article{gu2026nanoknow, title={NanoKnow: How to Know What Your Language Model Knows}, author={Gu, Lingwei and Jedidi, Nour and Lin, Jimmy}, journal={arXiv preprint arXiv:2602.20122}, year={2026} } ``` ## License Apache 2.0

许可证：Apache-2.0 任务类别： - 问答任务标签： - NanoKnow - 相关性判断文件（qrels） - NanoChat - FineWeb - 知识探查 - 参数知识 arXiv编号：2602.20122 样本规模区间：10000 < 样本量 < 100000 --- # NanoKnow基准数据集相关性判断集（NanoKnow Benchmark Qrels） [[论文]](https://arxiv.org/abs/2602.20122) [[代码]](https://github.com/castorini/NanoKnow) 本数据集提供预构建的**相关性判断文件（qrels）**，可将[SQuAD](https://rajpurkar.github.io/SQuAD-explorer/)与[自然问题数据集（Natural Questions, NQ）](https://ai.google.com/research/NaturalQuestions)划分为**支持集**与**非支持集**，划分依据为标准答案是否存在于[NanoChat](https://github.com/karpathy/nanochat)预训练语料库（[karpathy/fineweb-edu-100b-shuffle](https://huggingface.co/datasets/karpathy/fineweb-edu-100b-shuffle)）中。本相关性判断集属于**NanoKnow项目**，项目仓库地址为[github.com/castorini/NanoKnow](https://github.com/castorini/NanoKnow)。 ## 数据集划分 | 数据集 | 总问题数 | 支持集样本数 | 非支持集样本数 | |---------|---------|------------|--------------| | SQuAD | 10,570 | 7,560 (72%) | 3,010 (28%) | | NQ-Open | 3,610 | 2,391 (66%) | 1,219 (34%) | - **支持集**：标准答案存在于预训练语料库中，且经大语言模型（LLM）校验确认。此类问题用于测试参数知识（parametric knowledge）。 - **非支持集**：标准答案未出现在预训练语料库中。此类问题用于测试模型的泛化能力或依赖外部知识（检索增强生成，RAG）的能力。 ## 文件说明 | 文件路径 | 说明 | 格式 | |-----------------------------------|----------------------------------------------------------------------|-----------------------------------------------| | `qrels/squad_supported.txt` | SQuAD支持集问题（共7,560个问题，对应145,918个校验文档） | `qid, 问题, 标准答案, 文档ID, 答案偏移量` | | `qrels/squad_unsupported.txt` | SQuAD非支持集问题（共3,010个问题） | `qid, 问题, 标准答案` | | `qrels/nq_supported.txt` | NQ支持集问题（共2,391个问题，对应56,857个校验文档） | `qid, 问题, 标准答案, 文档ID, 答案偏移量` | | `qrels/nq_unsupported.txt` | NQ非支持集问题（共1,219个问题） | `qid, 问题, 标准答案` | ## 文件格式 **支持集相关性判断文件**将每个问题映射至一个或多个包含校验后标准答案的预训练文档，格式如下： qid, question, official_answer, doc_id, answer_offset - `doc_id`：文档标识符，格式为`shard_XXXXX_YYYYY`，分别代表分片编号与FineWeb-Edu Parquet文件内的行偏移量。 - `answer_offset`：标准答案字符串在文档中的字符偏移位置。 **非支持集相关性判断文件**列出了标准答案未在语料库中出现的问题，格式如下： qid, question, official_answer ## 生成流程本相关性判断集通过三阶段流程生成： 1. **BM25检索**：使用[Pyserini](https://github.com/castorini/pyserini)为每个问题从语料库中检索前100个候选文档。 2. **答案字符串匹配**：过滤出包含标准答案子串的文档。 3. **大语言模型校验**：使用Qwen/Qwen3-8B作为校验器，过滤掉巧合匹配的结果（例如在讨论美国德克萨斯州巴黎市的段落中出现的“Paris”）。 ## 使用方法 ### 下载 bash huggingface-cli download LingweiGu/NanoKnow_Benchmark --repo-type dataset --local-dir ./nanoknow-benchmark ### Python加载示例 python import csv def load_supported_qrels(filepath): qrels = [] with open(filepath) as f: for line in f: if line.startswith("#") or not line.strip(): continue parts = [p.strip() for p in line.split(",")] qrels.append({ "qid": int(parts[0]), "question": parts[1], "answer": parts[2], "doc_id": parts[3], "answer_offset": int(parts[4]), }) return qrels squad_supported = load_supported_qrels("nanoknow-benchmark/qrels/squad_supported.txt") print(f"已加载 {len(squad_supported)} 条支持集条目") ## 相关资源 - **Lucene索引**：[LingweiGu/NanoKnow-Fineweb-Edu-Index](https://huggingface.co/datasets/LingweiGu/NanoKnow-Fineweb-Edu-Index) — 用于生成本相关性判断集的FineWeb-Edu语料库预构建BM25索引（约326 GB）。 - **代码仓库**：[github.com/castorini/NanoKnow](https://github.com/castorini/NanoKnow) — 包含用于投影新基准数据集、评估NanoChat模型 checkpoint 以及分析频率效应的脚本。 ## 引用 bibtex @article{gu2026nanoknow, title={NanoKnow: How to Know What Your Language Model Knows}, author={Gu, Lingwei and Jedidi, Nour and Lin, Jimmy}, journal={arXiv预印本 arXiv:2602.20122}, year={2026} } ## 许可证 Apache-2.0

提供机构：

castorini

5,000+

优质数据集

54 个

任务类型

进入经典数据集