five

castorini/NanoKnow_Benchmark

收藏
Hugging Face2026-02-26 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/castorini/NanoKnow_Benchmark
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 task_categories: - question-answering tags: - nanoknow - qrels - nanochat - fineweb - knowledge-probing - parametric-knowledge arxiv: "2602.20122" size_categories: - 10K<n<100K --- # NanoKnow Benchmark Qrels [[Paper]](https://arxiv.org/abs/2602.20122) [[Code]](https://github.com/castorini/NanoKnow) Pre-built **relevance judgments (qrels)** that partition [SQuAD](https://rajpurkar.github.io/SQuAD-explorer/) and [Natural Questions (NQ)](https://ai.google.com/research/NaturalQuestions) into **supported** and **unsupported** splits based on whether the answer appears in the [nanochat](https://github.com/karpathy/nanochat) pre-training corpus ([karpathy/fineweb-edu-100b-shuffle](https://huggingface.co/datasets/karpathy/fineweb-edu-100b-shuffle)). These qrels are part of the **NanoKnow** project: [github.com/castorini/NanoKnow](https://github.com/castorini/NanoKnow) ## Splits | Dataset | Total Questions | Supported | Unsupported | |---------|----------------|-----------|-------------| | SQuAD | 10,570 | 7,560 (72%) | 3,010 (28%) | | NQ-Open | 3,610 | 2,391 (66%) | 1,219 (34%) | - **Supported** — The gold answer was found in the pre-training corpus and verified by an LLM judge. These questions test *parametric knowledge*. - **Unsupported** — The gold answer does not appear in the pre-training corpus. These questions test the model's ability to generalize or rely on *external knowledge* (RAG). ## Files | File | Description | Format | |------|-------------|--------| | `qrels/squad_supported.txt` | SQuAD supported questions (7,560 questions, 145,918 verified docs) | `qid, question, answer, doc_id, answer_offset` | | `qrels/squad_unsupported.txt` | SQuAD unsupported questions (3,010 questions) | `qid, question, answer` | | `qrels/nq_supported.txt` | NQ supported questions (2,391 questions, 56,857 verified docs) | `qid, question, answer, doc_id, answer_offset` | | `qrels/nq_unsupported.txt` | NQ unsupported questions (1,219 questions) | `qid, question, answer` | ## File Format **Supported qrels** map each question to one or more pre-training documents that contain a verified answer: ``` qid, question, official_answer, doc_id, answer_offset ``` - `doc_id`: Document identifier in the format `shard_XXXXX_YYYYY` (shard number and row offset within the FineWeb-Edu parquet files). - `answer_offset`: Character offset of the answer string within the document. **Unsupported qrels** list questions whose answers were not found in the corpus: ``` qid, question, official_answer ``` ## Pipeline The qrels were generated using a three-stage pipeline: 1. **BM25 Retrieval** — Search the corpus for the top-100 candidate documents per question using [Pyserini](https://github.com/castorini/pyserini). 2. **Answer String Matching** — Filter to documents containing the gold answer as a substring. 3. **LLM Verification** — Use Qwen/Qwen3-8B as a judge to filter out coincidental matches (e.g., "Paris" in a passage about Paris, Texas). ## Usage ### Download ```bash huggingface-cli download LingweiGu/NanoKnow_Benchmark --repo-type dataset --local-dir ./nanoknow-benchmark ``` ### Load in Python ```python import csv def load_supported_qrels(filepath): qrels = [] with open(filepath) as f: for line in f: if line.startswith("#") or not line.strip(): continue parts = [p.strip() for p in line.split(",")] qrels.append({ "qid": int(parts[0]), "question": parts[1], "answer": parts[2], "doc_id": parts[3], "answer_offset": int(parts[4]), }) return qrels squad_supported = load_supported_qrels("nanoknow-benchmark/qrels/squad_supported.txt") print(f"Loaded {len(squad_supported)} supported entries") ``` ## Related Resources - **Lucene Index**: [LingweiGu/NanoKnow-Fineweb-Edu-Index](https://huggingface.co/datasets/LingweiGu/NanoKnow-Fineweb-Edu-Index) — The pre-built BM25 index over the FineWeb-Edu corpus (~326 GB) used to generate these qrels. - **Code**: [github.com/castorini/NanoKnow](https://github.com/castorini/NanoKnow) — Scripts to project new benchmarks, evaluate nanochat checkpoints, and analyze frequency effects. ## Citation ```bibtex @article{gu2026nanoknow, title={NanoKnow: How to Know What Your Language Model Knows}, author={Gu, Lingwei and Jedidi, Nour and Lin, Jimmy}, journal={arXiv preprint arXiv:2602.20122}, year={2026} } ``` ## License Apache 2.0

许可证:Apache-2.0 任务类别: - 问答任务 标签: - NanoKnow - 相关性判断文件(qrels) - NanoChat - FineWeb - 知识探查 - 参数知识 arXiv编号:2602.20122 样本规模区间:10000 < 样本量 < 100000 --- # NanoKnow基准数据集相关性判断集(NanoKnow Benchmark Qrels) [[论文]](https://arxiv.org/abs/2602.20122) [[代码]](https://github.com/castorini/NanoKnow) 本数据集提供预构建的**相关性判断文件(qrels)**,可将[SQuAD](https://rajpurkar.github.io/SQuAD-explorer/)与[自然问题数据集(Natural Questions, NQ)](https://ai.google.com/research/NaturalQuestions)划分为**支持集**与**非支持集**,划分依据为标准答案是否存在于[NanoChat](https://github.com/karpathy/nanochat)预训练语料库([karpathy/fineweb-edu-100b-shuffle](https://huggingface.co/datasets/karpathy/fineweb-edu-100b-shuffle))中。 本相关性判断集属于**NanoKnow项目**,项目仓库地址为[github.com/castorini/NanoKnow](https://github.com/castorini/NanoKnow)。 ## 数据集划分 | 数据集 | 总问题数 | 支持集样本数 | 非支持集样本数 | |---------|---------|------------|--------------| | SQuAD | 10,570 | 7,560 (72%) | 3,010 (28%) | | NQ-Open | 3,610 | 2,391 (66%) | 1,219 (34%) | - **支持集**:标准答案存在于预训练语料库中,且经大语言模型(LLM)校验确认。此类问题用于测试参数知识(parametric knowledge)。 - **非支持集**:标准答案未出现在预训练语料库中。此类问题用于测试模型的泛化能力或依赖外部知识(检索增强生成,RAG)的能力。 ## 文件说明 | 文件路径 | 说明 | 格式 | |-----------------------------------|----------------------------------------------------------------------|-----------------------------------------------| | `qrels/squad_supported.txt` | SQuAD支持集问题(共7,560个问题,对应145,918个校验文档) | `qid, 问题, 标准答案, 文档ID, 答案偏移量` | | `qrels/squad_unsupported.txt` | SQuAD非支持集问题(共3,010个问题) | `qid, 问题, 标准答案` | | `qrels/nq_supported.txt` | NQ支持集问题(共2,391个问题,对应56,857个校验文档) | `qid, 问题, 标准答案, 文档ID, 答案偏移量` | | `qrels/nq_unsupported.txt` | NQ非支持集问题(共1,219个问题) | `qid, 问题, 标准答案` | ## 文件格式 **支持集相关性判断文件**将每个问题映射至一个或多个包含校验后标准答案的预训练文档,格式如下: qid, question, official_answer, doc_id, answer_offset - `doc_id`:文档标识符,格式为`shard_XXXXX_YYYYY`,分别代表分片编号与FineWeb-Edu Parquet文件内的行偏移量。 - `answer_offset`:标准答案字符串在文档中的字符偏移位置。 **非支持集相关性判断文件**列出了标准答案未在语料库中出现的问题,格式如下: qid, question, official_answer ## 生成流程 本相关性判断集通过三阶段流程生成: 1. **BM25检索**:使用[Pyserini](https://github.com/castorini/pyserini)为每个问题从语料库中检索前100个候选文档。 2. **答案字符串匹配**:过滤出包含标准答案子串的文档。 3. **大语言模型校验**:使用Qwen/Qwen3-8B作为校验器,过滤掉巧合匹配的结果(例如在讨论美国德克萨斯州巴黎市的段落中出现的“Paris”)。 ## 使用方法 ### 下载 bash huggingface-cli download LingweiGu/NanoKnow_Benchmark --repo-type dataset --local-dir ./nanoknow-benchmark ### Python加载示例 python import csv def load_supported_qrels(filepath): qrels = [] with open(filepath) as f: for line in f: if line.startswith("#") or not line.strip(): continue parts = [p.strip() for p in line.split(",")] qrels.append({ "qid": int(parts[0]), "question": parts[1], "answer": parts[2], "doc_id": parts[3], "answer_offset": int(parts[4]), }) return qrels squad_supported = load_supported_qrels("nanoknow-benchmark/qrels/squad_supported.txt") print(f"已加载 {len(squad_supported)} 条支持集条目") ## 相关资源 - **Lucene索引**:[LingweiGu/NanoKnow-Fineweb-Edu-Index](https://huggingface.co/datasets/LingweiGu/NanoKnow-Fineweb-Edu-Index) — 用于生成本相关性判断集的FineWeb-Edu语料库预构建BM25索引(约326 GB)。 - **代码仓库**:[github.com/castorini/NanoKnow](https://github.com/castorini/NanoKnow) — 包含用于投影新基准数据集、评估NanoChat模型 checkpoint 以及分析频率效应的脚本。 ## 引用 bibtex @article{gu2026nanoknow, title={NanoKnow: How to Know What Your Language Model Knows}, author={Gu, Lingwei and Jedidi, Nour and Lin, Jimmy}, journal={arXiv预印本 arXiv:2602.20122}, year={2026} } ## 许可证 Apache-2.0
提供机构:
castorini
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作