spark-scholar-arxiv-snapshots
收藏Hugging Face2026-03-14 更新2026-03-16 收录
下载链接:
https://huggingface.co/datasets/MARKYMARK55/spark-scholar-arxiv-snapshots
下载链接
链接失效反馈官方服务:
资源简介:
Spark-Scholar arXiv快照数据集是一个专为高效混合搜索设计的开源数据集,包含308万篇arXiv论文的BGE-M3密集(1024维)和稀疏(SPLADE)向量。数据集分为15个领域特定的集合,每个集合包含论文的完整元数据(标题、摘要、作者、类别、日期等)以及预计算的向量嵌入。这些快照可直接恢复到Qdrant向量数据库,支持原生RRF混合搜索,旨在为Spark-Scholar自托管研究平台提供数据层支持。数据集避免了单一大型集合的高延迟问题,通过领域分割实现了亚10毫秒的搜索速度。此外,数据集还提供了详细的元数据结构和向量参数说明,适用于信息检索、特征提取等任务。
The Spark-Scholar arXiv Snapshot Dataset is an open-source dataset tailored specifically for efficient hybrid search. It contains BGE-M3 dense (1024-dimensional) and sparse (SPLADE) vectors for 3.08 million arXiv papers. The dataset is split into 15 domain-specific subsets, each holding complete paper metadata (title, abstract, authors, categories, publication date, etc.) and pre-computed vector embeddings. These snapshots can be directly imported into Qdrant vector databases, enabling native RRF hybrid search, and are designed to provide data layer support for the self-hosted Spark-Scholar research platform. The dataset circumvents the high latency issues associated with single large collections, achieving sub-10 millisecond search speeds via domain-based partitioning. Furthermore, the dataset includes detailed metadata structure and vector parameter specifications, making it suitable for tasks including information retrieval and feature extraction.
创建时间:
2026-03-13



