five

ismailemir/arxiv-indices

收藏
Hugging Face2026-01-03 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/ismailemir/arxiv-indices
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 tags: - arxiv - search - retrieval - bm25 - tfidf - scibert - information-retrieval size_categories: - 1M<n<10M --- # ArXiv Search Indices Pre-built search indices for ArXiv paper corpus supporting multiple retrieval methods. ## 📊 Contents | File | Description | Format | |------|-------------|--------| | `bm25_index.zip` | BM25 lexical search index | Compressed folder | | `tfidf_index.npz` | TF-IDF term frequency index | NumPy compressed | | `scibert_embeddings.npy` | SciBERT semantic embeddings | NumPy array | ## 🚀 Quick Start ```python from huggingface_hub import hf_hub_download import zipfile import bm25s import numpy as np # Download and extract BM25 bm25_zip = hf_hub_download("ismailemir/arxiv-indices", "bm25_index.zip", repo_type="dataset") with zipfile.ZipFile(bm25_zip, 'r') as zip_ref: zip_ref.extractall("./bm25_index") bm25_retriever = bm25s.BM25.load("./bm25_index") # Download TF-IDF tfidf_path = hf_hub_download("ismailemir/arxiv-indices", "tfidf_index.npz", repo_type="dataset") tfidf_data = np.load(tfidf_path, allow_pickle=True) # Download SciBERT embeddings scibert_path = hf_hub_download("ismailemir/arxiv-indices", "scibert_embeddings.npy", repo_type="dataset") embeddings = np.load(scibert_path) ``` ## 🔍 Retrieval Methods ### BM25 (Lexical) - Best for exact keyword matching - Fast and efficient - Great for technical terms ### TF-IDF (Term Frequency) - Statistical word importance - Good for general search - Lightweight and fast ### SciBERT (Semantic) - Deep learning embeddings - Understands context and meaning - Best for conceptual search ## 📚 Related - 📖 Corpus: [ismailemir/arxiv-corpus](https://huggingface.co/datasets/ismailemir/arxiv-corpus) - 🔬 Source: [Cornell ArXiv Dataset](https://www.kaggle.com/datasets/Cornell-University/arxiv) - 🤖 Model: [SciBERT](https://huggingface.co/allenai/scibert_scivocab_uncased) ## 📄 License Apache 2.0 - Please cite ArXiv if using this data. ## 🙏 Citation ```bibtex @article{clement2019arxiv, title={On the Use of ArXiv as a Dataset}, author={Clement, Colin B and Bierbaum, Matthew and O'Keeffe, Kevin P and Alemi, Alexander A}, journal={arXiv preprint arXiv:1905.00075}, year={2019} } ```
提供机构:
ismailemir
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作