five

castorini/NanoKnow-Fineweb-Edu-Index

收藏
Hugging Face2026-04-18 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/castorini/NanoKnow-Fineweb-Edu-Index
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 size_categories: - 10M<n<100M task_categories: - text-retrieval tags: - lucene - bm25 - fineweb - nanochat - information-retrieval --- # NanoKnow FineWeb-Edu Lucene Index [[Paper](https://huggingface.co/papers/2602.20122)] [[Code](https://github.com/castorini/NanoKnow)] A pre-built [Lucene](https://lucene.apache.org/) BM25 index over [karpathy/fineweb-edu-100b-shuffle](https://huggingface.co/datasets/karpathy/fineweb-edu-100b-shuffle)—the exact pre-training corpus used by the [nanochat](https://github.com/karpathy/nanochat) family of language models. Built with [Anserini](https://github.com/castorini/anserini). This index is part of the **NanoKnow** project: [github.com/castorini/NanoKnow](https://github.com/castorini/NanoKnow) ## Index Details | Property | Value | |----------|-------| | **Corpus** | [karpathy/fineweb-edu-100b-shuffle](https://huggingface.co/datasets/karpathy/fineweb-edu-100b-shuffle) | | **Documents** | 97,230,848 | | **Index Size** | ~325 GB (extracted) | | **Index Type** | Lucene (BM25) | | **Built With** | Anserini / Pyserini | | **Distribution** | 6 × `tar.part.*` files (~324 GB total), 680 Lucene segment files when extracted | ## Document ID Format Each document has a unique ID: `shard_XXXXX_YYYYY` - `XXXXX`: zero-padded shard number (0-1822) - `YYYYY`: row offset within the parquet shard For example, `shard_00151_20323` refers to row 20,323 in shard 151 of the FineWeb-Edu parquet files. ## Usage ### Download The index is distributed as 6 split tar parts. Download all 6 parts and reassemble: ```bash # Download all 6 parts (each ~64 GB; part.05 is ~4.4 GB) for i in 00 01 02 03 04 05; do wget https://huggingface.co/datasets/castorini/NanoKnow-Fineweb-Edu-Index/resolve/main/lucene-inverted.fineweb-edu-100b-karpathy.20260416.tar.part.$i done # (Optional) Verify checksums md5sum -c <<'EOF' 309e75651d954a4d81edc6bc5b8f1d38 lucene-inverted.fineweb-edu-100b-karpathy.20260416.tar.part.00 313260d601b88ec443d2e7db94df08df lucene-inverted.fineweb-edu-100b-karpathy.20260416.tar.part.01 a2b446e7a40d89b1975c95f1abbd8683 lucene-inverted.fineweb-edu-100b-karpathy.20260416.tar.part.02 1e647f11aa01016a53f6c0847ce7ae86 lucene-inverted.fineweb-edu-100b-karpathy.20260416.tar.part.03 47a49ee4b2c7344b625e999c9658f817 lucene-inverted.fineweb-edu-100b-karpathy.20260416.tar.part.04 65ec80b055978356e5bd1772bdf18151 lucene-inverted.fineweb-edu-100b-karpathy.20260416.tar.part.05 EOF # Reassemble + extract (streaming; never materializes the 325 GB tar on disk) cat lucene-inverted.fineweb-edu-100b-karpathy.20260416.tar.part.* | tar -xf - # This creates the directory: # lucene-inverted.fineweb-edu-100b-karpathy.20260416/ ``` Alternatively, you can use the Hugging Face CLI to fetch all 6 parts in one shot: ```bash hf download castorini/NanoKnow-Fineweb-Edu-Index --repo-type dataset --local-dir ./fineweb-edu-index cd ./fineweb-edu-index cat lucene-inverted.fineweb-edu-100b-karpathy.20260416.tar.part.* | tar -xf - ``` ### Search with Pyserini ```python from pyserini.search.lucene import LuceneSearcher searcher = LuceneSearcher("./lucene-inverted.fineweb-edu-100b-karpathy.20260416") print(f"Index contains {searcher.num_docs:,} documents") hits = searcher.search("What is the capital of France?", k=10) for hit in hits: print(f"{hit.docid}: {hit.score:.4f}") ``` ### Retrieve Document Text ```python import json doc = searcher.doc("shard_00151_20323") text = json.loads(doc.raw())["contents"] print(text[:500]) ``` ## Reproducing BM25 Effectiveness This index reproduces the published Anserini regression for NanoKnow v1 (NQ-Open validation): **R@20 = 0.3283** with default BM25 (`k1=0.9, b=0.4`). See the [Anserini documentation](https://github.com/castorini/anserini/blob/master/docs/reproduce/from-document-collection/nanoknow-v1-nq.md) for the full reproduction recipe. ## Related Resources - **Benchmark Qrels**: [LingweiGu/NanoKnow_Benchmark](https://huggingface.co/datasets/LingweiGu/NanoKnow_Benchmark) — Pre-built relevance judgments that partition SQuAD and NQ questions into supported/unsupported splits based on this corpus. - **Code**: [github.com/castorini/NanoKnow](https://github.com/castorini/NanoKnow) — Scripts to project new benchmarks onto this index, evaluate nanochat checkpoints, and analyze frequency effects. ## Citation ```bibtex @article{gu2026nanoknow, title={NanoKnow: How to Know What Your Language Model Knows}, author={Gu, Lingwei and Jedidi, Nour and Lin, Jimmy}, journal={arXiv preprint arXiv:2602.20122}, year={2026} } ``` ## License Apache 2.0
提供机构:
castorini
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作