castorini/NanoKnow-Fineweb-Edu-Index
收藏Hugging Face2026-04-18 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/castorini/NanoKnow-Fineweb-Edu-Index
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
size_categories:
- 10M<n<100M
task_categories:
- text-retrieval
tags:
- lucene
- bm25
- fineweb
- nanochat
- information-retrieval
---
# NanoKnow FineWeb-Edu Lucene Index
[[Paper](https://huggingface.co/papers/2602.20122)] [[Code](https://github.com/castorini/NanoKnow)]
A pre-built [Lucene](https://lucene.apache.org/) BM25 index over [karpathy/fineweb-edu-100b-shuffle](https://huggingface.co/datasets/karpathy/fineweb-edu-100b-shuffle)—the exact pre-training corpus used by the [nanochat](https://github.com/karpathy/nanochat) family of language models. Built with [Anserini](https://github.com/castorini/anserini).
This index is part of the **NanoKnow** project: [github.com/castorini/NanoKnow](https://github.com/castorini/NanoKnow)
## Index Details
| Property | Value |
|----------|-------|
| **Corpus** | [karpathy/fineweb-edu-100b-shuffle](https://huggingface.co/datasets/karpathy/fineweb-edu-100b-shuffle) |
| **Documents** | 97,230,848 |
| **Index Size** | ~325 GB (extracted) |
| **Index Type** | Lucene (BM25) |
| **Built With** | Anserini / Pyserini |
| **Distribution** | 6 × `tar.part.*` files (~324 GB total), 680 Lucene segment files when extracted |
## Document ID Format
Each document has a unique ID: `shard_XXXXX_YYYYY`
- `XXXXX`: zero-padded shard number (0-1822)
- `YYYYY`: row offset within the parquet shard
For example, `shard_00151_20323` refers to row 20,323 in shard 151 of the FineWeb-Edu parquet files.
## Usage
### Download
The index is distributed as 6 split tar parts. Download all 6 parts and reassemble:
```bash
# Download all 6 parts (each ~64 GB; part.05 is ~4.4 GB)
for i in 00 01 02 03 04 05; do
wget https://huggingface.co/datasets/castorini/NanoKnow-Fineweb-Edu-Index/resolve/main/lucene-inverted.fineweb-edu-100b-karpathy.20260416.tar.part.$i
done
# (Optional) Verify checksums
md5sum -c <<'EOF'
309e75651d954a4d81edc6bc5b8f1d38 lucene-inverted.fineweb-edu-100b-karpathy.20260416.tar.part.00
313260d601b88ec443d2e7db94df08df lucene-inverted.fineweb-edu-100b-karpathy.20260416.tar.part.01
a2b446e7a40d89b1975c95f1abbd8683 lucene-inverted.fineweb-edu-100b-karpathy.20260416.tar.part.02
1e647f11aa01016a53f6c0847ce7ae86 lucene-inverted.fineweb-edu-100b-karpathy.20260416.tar.part.03
47a49ee4b2c7344b625e999c9658f817 lucene-inverted.fineweb-edu-100b-karpathy.20260416.tar.part.04
65ec80b055978356e5bd1772bdf18151 lucene-inverted.fineweb-edu-100b-karpathy.20260416.tar.part.05
EOF
# Reassemble + extract (streaming; never materializes the 325 GB tar on disk)
cat lucene-inverted.fineweb-edu-100b-karpathy.20260416.tar.part.* | tar -xf -
# This creates the directory:
# lucene-inverted.fineweb-edu-100b-karpathy.20260416/
```
Alternatively, you can use the Hugging Face CLI to fetch all 6 parts in one shot:
```bash
hf download castorini/NanoKnow-Fineweb-Edu-Index --repo-type dataset --local-dir ./fineweb-edu-index
cd ./fineweb-edu-index
cat lucene-inverted.fineweb-edu-100b-karpathy.20260416.tar.part.* | tar -xf -
```
### Search with Pyserini
```python
from pyserini.search.lucene import LuceneSearcher
searcher = LuceneSearcher("./lucene-inverted.fineweb-edu-100b-karpathy.20260416")
print(f"Index contains {searcher.num_docs:,} documents")
hits = searcher.search("What is the capital of France?", k=10)
for hit in hits:
print(f"{hit.docid}: {hit.score:.4f}")
```
### Retrieve Document Text
```python
import json
doc = searcher.doc("shard_00151_20323")
text = json.loads(doc.raw())["contents"]
print(text[:500])
```
## Reproducing BM25 Effectiveness
This index reproduces the published Anserini regression for NanoKnow v1 (NQ-Open
validation): **R@20 = 0.3283** with default BM25 (`k1=0.9, b=0.4`). See the
[Anserini documentation](https://github.com/castorini/anserini/blob/master/docs/reproduce/from-document-collection/nanoknow-v1-nq.md)
for the full reproduction recipe.
## Related Resources
- **Benchmark Qrels**: [LingweiGu/NanoKnow_Benchmark](https://huggingface.co/datasets/LingweiGu/NanoKnow_Benchmark) — Pre-built relevance judgments that partition SQuAD and NQ questions into supported/unsupported splits based on this corpus.
- **Code**: [github.com/castorini/NanoKnow](https://github.com/castorini/NanoKnow) — Scripts to project new benchmarks onto this index, evaluate nanochat checkpoints, and analyze frequency effects.
## Citation
```bibtex
@article{gu2026nanoknow,
title={NanoKnow: How to Know What Your Language Model Knows},
author={Gu, Lingwei and Jedidi, Nour and Lin, Jimmy},
journal={arXiv preprint arXiv:2602.20122},
year={2026}
}
```
## License
Apache 2.0
提供机构:
castorini



