castorini/NanoKnow_Benchmark
收藏Hugging Face2026-02-26 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/castorini/NanoKnow_Benchmark
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
task_categories:
- question-answering
tags:
- nanoknow
- qrels
- nanochat
- fineweb
- knowledge-probing
- parametric-knowledge
arxiv: "2602.20122"
size_categories:
- 10K<n<100K
---
# NanoKnow Benchmark Qrels
[[Paper]](https://arxiv.org/abs/2602.20122) [[Code]](https://github.com/castorini/NanoKnow)
Pre-built **relevance judgments (qrels)** that partition [SQuAD](https://rajpurkar.github.io/SQuAD-explorer/) and [Natural Questions (NQ)](https://ai.google.com/research/NaturalQuestions) into **supported** and **unsupported** splits based on whether the answer appears in the [nanochat](https://github.com/karpathy/nanochat) pre-training corpus ([karpathy/fineweb-edu-100b-shuffle](https://huggingface.co/datasets/karpathy/fineweb-edu-100b-shuffle)).
These qrels are part of the **NanoKnow** project: [github.com/castorini/NanoKnow](https://github.com/castorini/NanoKnow)
## Splits
| Dataset | Total Questions | Supported | Unsupported |
|---------|----------------|-----------|-------------|
| SQuAD | 10,570 | 7,560 (72%) | 3,010 (28%) |
| NQ-Open | 3,610 | 2,391 (66%) | 1,219 (34%) |
- **Supported** — The gold answer was found in the pre-training corpus and verified by an LLM judge. These questions test *parametric knowledge*.
- **Unsupported** — The gold answer does not appear in the pre-training corpus. These questions test the model's ability to generalize or rely on *external knowledge* (RAG).
## Files
| File | Description | Format |
|------|-------------|--------|
| `qrels/squad_supported.txt` | SQuAD supported questions (7,560 questions, 145,918 verified docs) | `qid, question, answer, doc_id, answer_offset` |
| `qrels/squad_unsupported.txt` | SQuAD unsupported questions (3,010 questions) | `qid, question, answer` |
| `qrels/nq_supported.txt` | NQ supported questions (2,391 questions, 56,857 verified docs) | `qid, question, answer, doc_id, answer_offset` |
| `qrels/nq_unsupported.txt` | NQ unsupported questions (1,219 questions) | `qid, question, answer` |
## File Format
**Supported qrels** map each question to one or more pre-training documents that contain a verified answer:
```
qid, question, official_answer, doc_id, answer_offset
```
- `doc_id`: Document identifier in the format `shard_XXXXX_YYYYY` (shard number and row offset within the FineWeb-Edu parquet files).
- `answer_offset`: Character offset of the answer string within the document.
**Unsupported qrels** list questions whose answers were not found in the corpus:
```
qid, question, official_answer
```
## Pipeline
The qrels were generated using a three-stage pipeline:
1. **BM25 Retrieval** — Search the corpus for the top-100 candidate documents per question using [Pyserini](https://github.com/castorini/pyserini).
2. **Answer String Matching** — Filter to documents containing the gold answer as a substring.
3. **LLM Verification** — Use Qwen/Qwen3-8B as a judge to filter out coincidental matches (e.g., "Paris" in a passage about Paris, Texas).
## Usage
### Download
```bash
huggingface-cli download LingweiGu/NanoKnow_Benchmark --repo-type dataset --local-dir ./nanoknow-benchmark
```
### Load in Python
```python
import csv
def load_supported_qrels(filepath):
qrels = []
with open(filepath) as f:
for line in f:
if line.startswith("#") or not line.strip():
continue
parts = [p.strip() for p in line.split(",")]
qrels.append({
"qid": int(parts[0]),
"question": parts[1],
"answer": parts[2],
"doc_id": parts[3],
"answer_offset": int(parts[4]),
})
return qrels
squad_supported = load_supported_qrels("nanoknow-benchmark/qrels/squad_supported.txt")
print(f"Loaded {len(squad_supported)} supported entries")
```
## Related Resources
- **Lucene Index**: [LingweiGu/NanoKnow-Fineweb-Edu-Index](https://huggingface.co/datasets/LingweiGu/NanoKnow-Fineweb-Edu-Index) — The pre-built BM25 index over the FineWeb-Edu corpus (~326 GB) used to generate these qrels.
- **Code**: [github.com/castorini/NanoKnow](https://github.com/castorini/NanoKnow) — Scripts to project new benchmarks, evaluate nanochat checkpoints, and analyze frequency effects.
## Citation
```bibtex
@article{gu2026nanoknow,
title={NanoKnow: How to Know What Your Language Model Knows},
author={Gu, Lingwei and Jedidi, Nour and Lin, Jimmy},
journal={arXiv preprint arXiv:2602.20122},
year={2026}
}
```
## License
Apache 2.0
许可证:Apache-2.0
任务类别:
- 问答任务
标签:
- NanoKnow
- 相关性判断文件(qrels)
- NanoChat
- FineWeb
- 知识探查
- 参数知识
arXiv编号:2602.20122
样本规模区间:10000 < 样本量 < 100000
---
# NanoKnow基准数据集相关性判断集(NanoKnow Benchmark Qrels)
[[论文]](https://arxiv.org/abs/2602.20122) [[代码]](https://github.com/castorini/NanoKnow)
本数据集提供预构建的**相关性判断文件(qrels)**,可将[SQuAD](https://rajpurkar.github.io/SQuAD-explorer/)与[自然问题数据集(Natural Questions, NQ)](https://ai.google.com/research/NaturalQuestions)划分为**支持集**与**非支持集**,划分依据为标准答案是否存在于[NanoChat](https://github.com/karpathy/nanochat)预训练语料库([karpathy/fineweb-edu-100b-shuffle](https://huggingface.co/datasets/karpathy/fineweb-edu-100b-shuffle))中。
本相关性判断集属于**NanoKnow项目**,项目仓库地址为[github.com/castorini/NanoKnow](https://github.com/castorini/NanoKnow)。
## 数据集划分
| 数据集 | 总问题数 | 支持集样本数 | 非支持集样本数 |
|---------|---------|------------|--------------|
| SQuAD | 10,570 | 7,560 (72%) | 3,010 (28%) |
| NQ-Open | 3,610 | 2,391 (66%) | 1,219 (34%) |
- **支持集**:标准答案存在于预训练语料库中,且经大语言模型(LLM)校验确认。此类问题用于测试参数知识(parametric knowledge)。
- **非支持集**:标准答案未出现在预训练语料库中。此类问题用于测试模型的泛化能力或依赖外部知识(检索增强生成,RAG)的能力。
## 文件说明
| 文件路径 | 说明 | 格式 |
|-----------------------------------|----------------------------------------------------------------------|-----------------------------------------------|
| `qrels/squad_supported.txt` | SQuAD支持集问题(共7,560个问题,对应145,918个校验文档) | `qid, 问题, 标准答案, 文档ID, 答案偏移量` |
| `qrels/squad_unsupported.txt` | SQuAD非支持集问题(共3,010个问题) | `qid, 问题, 标准答案` |
| `qrels/nq_supported.txt` | NQ支持集问题(共2,391个问题,对应56,857个校验文档) | `qid, 问题, 标准答案, 文档ID, 答案偏移量` |
| `qrels/nq_unsupported.txt` | NQ非支持集问题(共1,219个问题) | `qid, 问题, 标准答案` |
## 文件格式
**支持集相关性判断文件**将每个问题映射至一个或多个包含校验后标准答案的预训练文档,格式如下:
qid, question, official_answer, doc_id, answer_offset
- `doc_id`:文档标识符,格式为`shard_XXXXX_YYYYY`,分别代表分片编号与FineWeb-Edu Parquet文件内的行偏移量。
- `answer_offset`:标准答案字符串在文档中的字符偏移位置。
**非支持集相关性判断文件**列出了标准答案未在语料库中出现的问题,格式如下:
qid, question, official_answer
## 生成流程
本相关性判断集通过三阶段流程生成:
1. **BM25检索**:使用[Pyserini](https://github.com/castorini/pyserini)为每个问题从语料库中检索前100个候选文档。
2. **答案字符串匹配**:过滤出包含标准答案子串的文档。
3. **大语言模型校验**:使用Qwen/Qwen3-8B作为校验器,过滤掉巧合匹配的结果(例如在讨论美国德克萨斯州巴黎市的段落中出现的“Paris”)。
## 使用方法
### 下载
bash
huggingface-cli download LingweiGu/NanoKnow_Benchmark --repo-type dataset --local-dir ./nanoknow-benchmark
### Python加载示例
python
import csv
def load_supported_qrels(filepath):
qrels = []
with open(filepath) as f:
for line in f:
if line.startswith("#") or not line.strip():
continue
parts = [p.strip() for p in line.split(",")]
qrels.append({
"qid": int(parts[0]),
"question": parts[1],
"answer": parts[2],
"doc_id": parts[3],
"answer_offset": int(parts[4]),
})
return qrels
squad_supported = load_supported_qrels("nanoknow-benchmark/qrels/squad_supported.txt")
print(f"已加载 {len(squad_supported)} 条支持集条目")
## 相关资源
- **Lucene索引**:[LingweiGu/NanoKnow-Fineweb-Edu-Index](https://huggingface.co/datasets/LingweiGu/NanoKnow-Fineweb-Edu-Index) — 用于生成本相关性判断集的FineWeb-Edu语料库预构建BM25索引(约326 GB)。
- **代码仓库**:[github.com/castorini/NanoKnow](https://github.com/castorini/NanoKnow) — 包含用于投影新基准数据集、评估NanoChat模型 checkpoint 以及分析频率效应的脚本。
## 引用
bibtex
@article{gu2026nanoknow,
title={NanoKnow: How to Know What Your Language Model Knows},
author={Gu, Lingwei and Jedidi, Nour and Lin, Jimmy},
journal={arXiv预印本 arXiv:2602.20122},
year={2026}
}
## 许可证
Apache-2.0
提供机构:
castorini



