ismailemir/arxiv-indices
收藏Hugging Face2026-01-03 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/ismailemir/arxiv-indices
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
tags:
- arxiv
- search
- retrieval
- bm25
- tfidf
- scibert
- information-retrieval
size_categories:
- 1M<n<10M
---
# ArXiv Search Indices
Pre-built search indices for ArXiv paper corpus supporting multiple retrieval methods.
## 📊 Contents
| File | Description | Format |
|------|-------------|--------|
| `bm25_index.zip` | BM25 lexical search index | Compressed folder |
| `tfidf_index.npz` | TF-IDF term frequency index | NumPy compressed |
| `scibert_embeddings.npy` | SciBERT semantic embeddings | NumPy array |
## 🚀 Quick Start
```python
from huggingface_hub import hf_hub_download
import zipfile
import bm25s
import numpy as np
# Download and extract BM25
bm25_zip = hf_hub_download("ismailemir/arxiv-indices", "bm25_index.zip", repo_type="dataset")
with zipfile.ZipFile(bm25_zip, 'r') as zip_ref:
zip_ref.extractall("./bm25_index")
bm25_retriever = bm25s.BM25.load("./bm25_index")
# Download TF-IDF
tfidf_path = hf_hub_download("ismailemir/arxiv-indices", "tfidf_index.npz", repo_type="dataset")
tfidf_data = np.load(tfidf_path, allow_pickle=True)
# Download SciBERT embeddings
scibert_path = hf_hub_download("ismailemir/arxiv-indices", "scibert_embeddings.npy", repo_type="dataset")
embeddings = np.load(scibert_path)
```
## 🔍 Retrieval Methods
### BM25 (Lexical)
- Best for exact keyword matching
- Fast and efficient
- Great for technical terms
### TF-IDF (Term Frequency)
- Statistical word importance
- Good for general search
- Lightweight and fast
### SciBERT (Semantic)
- Deep learning embeddings
- Understands context and meaning
- Best for conceptual search
## 📚 Related
- 📖 Corpus: [ismailemir/arxiv-corpus](https://huggingface.co/datasets/ismailemir/arxiv-corpus)
- 🔬 Source: [Cornell ArXiv Dataset](https://www.kaggle.com/datasets/Cornell-University/arxiv)
- 🤖 Model: [SciBERT](https://huggingface.co/allenai/scibert_scivocab_uncased)
## 📄 License
Apache 2.0 - Please cite ArXiv if using this data.
## 🙏 Citation
```bibtex
@article{clement2019arxiv,
title={On the Use of ArXiv as a Dataset},
author={Clement, Colin B and Bierbaum, Matthew and O'Keeffe, Kevin P and Alemi, Alexander A},
journal={arXiv preprint arXiv:1905.00075},
year={2019}
}
```
提供机构:
ismailemir



