Comet/wikipedia-2017-bm25

Name: Comet/wikipedia-2017-bm25
Creator: Comet
Published: 2025-11-24 08:52:54
License: 暂无描述

Hugging Face2025-11-24 更新2025-12-20 收录

下载链接：

https://hf-mirror.com/datasets/Comet/wikipedia-2017-bm25

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-sa-3.0 task_categories: - text-retrieval language: - en tags: - bm25 - wikipedia - information-retrieval - research size_categories: - 1M<n<10M --- # Wikipedia 2017 BM25 Search Index This dataset provides a production-ready BM25 search index over **5.2 million Wikipedia article abstracts** from the 2017 snapshot. Built using the `bm25s` library with English stemming and optimized Parquet compression, it enables fast, offline information retrieval for research and production AI systems. The corpus is identical to the one used in influential AI research papers including DSPy and GEPA, ensuring reproducible benchmarking and fair comparison across studies. The index uses BM25 (Best Matching 25), a probabilistic ranking function widely recognized as the gold standard for lexical search. With carefully tuned parameters (k1=0.9 for term frequency saturation, b=0.4 for document length normalization), it provides state-of-the-art retrieval performance for factual queries. Each search completes in under 100ms on consumer hardware, making it suitable for real-time applications, RAG (Retrieval-Augmented Generation) pipelines, and agent tool implementations. We created this index for the [Opik Optimizer](https://github.com/comet-ml/opik) project to enable reproducible prompt optimization experiments and agent benchmarking. By using the same Wikipedia 2017 corpus as established research, we ensure that optimization results are directly comparable to published baselines. The Parquet-compressed format reduces download size by 67% while maintaining full search fidelity, making it practical for cloud deployments and CI/CD pipelines where storage costs and download times matter. **Size**: 1.61 GB | **Format**: Parquet (67% compressed) | **Documents**: 5.2M ## Format: Optimized (Parquet) This is the **optimized version** with 40-50% size reduction: - **Corpus Format**: Chunked Parquet with ZSTD compression (level 9) - **Index Format**: NumPy compressed arrays (.npz) - **Size**: ~2.5-3.5 GB (vs ~4.87 GB standard) ### Benefits: - **Smaller downloads**: 40-50% reduction vs standard format - **Streaming access**: Load only needed document chunks - **Faster HF downloads**: Parallel chunk downloads - **Better compression**: ZSTD level 9 on columnar format ### When to use: - Running on Modal or cloud workers (storage costs matter) - Bandwidth-constrained environments - High-volume deployments --- ## Quick Start Install package ```bash pip install opik_optimizer[bm25] ``` Run the search ```python from opik_optimizer.utils.tools.wikipedia import search_wikipedia results = search_wikipedia( "quantum entanglement", search_type="bm25", n=5, bm25_hf_repo="Comet/wikipedia-2017-bm25" ) ``` **That's it!** First run downloads the index (~1.6 GB), subsequent searches are instant. --- ## Why Use This? - **Reproducible Research** - Same corpus used in various prompt and agent optimization papers - **Fast & Offline** - No API rate limits, <100ms query time - **Production Ready** - Powers RAG systems, Q&A benchmarks, agents - **Memory Efficient** - Optimized Parquet format with chunked loading ## Use Cases **Research & Benchmarking** - HotpotQA multi-hop question answering - Information retrieval experiments - RAG pipeline evaluation - Agent tool development **Production Applications** - Offline knowledge base for AI agents - Research paper search - Educational tools - Content recommendation --- ## Index Specifications | Attribute | Value | |-----------|-------| | Documents | 5,233,330 Wikipedia abstracts | | Source | Wikipedia 2017 dump (DSPy cache) | | Tokenization | English stemming + stopword removal | | Algorithm | BM25 (k1=0.9, b=0.4) | | Library | [`bm25s`](https://github.com/xhluca/bm25s) | | Memory | ~6-8 GB RAM during search | | Query Speed | <100ms per search | --- ## License & Attribution **Dataset License**: [CC-BY-SA 3.0](https://creativecommons.org/licenses/by-sa/3.0/) (Wikipedia content license) **Source**: Wikipedia 2017 abstracts from [DSPy cache](https://huggingface.co/dspy/cache) **Citation**: If you use this in research, please cite: ``` @misc{comet_ml_wikibm25, author = { Comet ML and Vincent Koc }, title = { wikipedia-2017-bm25 (Revision 07db6d6) }, year = 2025, url = { https://huggingface.co/datasets/Comet/wikipedia-2017-bm25 }, doi = { 10.57967/hf/7073 }, publisher = { Hugging Face } } ``` --- ## Related Links and Datasets - [wikipedia dataset](https://huggingface.co/datasets/wikipedia) Full Wikipedia dumps (all languages) - [opik_optimizer](https://github.com/comet-ml/opik/tree/main/sdks/opik_optimizer) repository **Built with [opik_optimizer](https://github.com/comet-ml/opik) by Comet** - Thanks to [@vincentkoc](https://github.com/vincentkoc) for creating this Parquet version.

提供机构：

Comet

5,000+

优质数据集

54 个

任务类型

进入经典数据集