spark-scholar-arxiv-snapshots

Hugging Face2026-03-14 更新2026-03-16 收录

下载链接：

https://huggingface.co/datasets/MARKYMARK55/spark-scholar-arxiv-snapshots

下载链接

链接失效反馈

官方服务：

资源简介：

Spark-Scholar arXiv快照数据集是一个专为高效混合搜索设计的开源数据集，包含308万篇arXiv论文的BGE-M3密集（1024维）和稀疏（SPLADE）向量。数据集分为15个领域特定的集合，每个集合包含论文的完整元数据（标题、摘要、作者、类别、日期等）以及预计算的向量嵌入。这些快照可直接恢复到Qdrant向量数据库，支持原生RRF混合搜索，旨在为Spark-Scholar自托管研究平台提供数据层支持。数据集避免了单一大型集合的高延迟问题，通过领域分割实现了亚10毫秒的搜索速度。此外，数据集还提供了详细的元数据结构和向量参数说明，适用于信息检索、特征提取等任务。

The Spark-Scholar arXiv Snapshot Dataset is an open-source dataset tailored specifically for efficient hybrid search. It contains BGE-M3 dense (1024-dimensional) and sparse (SPLADE) vectors for 3.08 million arXiv papers. The dataset is split into 15 domain-specific subsets, each holding complete paper metadata (title, abstract, authors, categories, publication date, etc.) and pre-computed vector embeddings. These snapshots can be directly imported into Qdrant vector databases, enabling native RRF hybrid search, and are designed to provide data layer support for the self-hosted Spark-Scholar research platform. The dataset circumvents the high latency issues associated with single large collections, achieving sub-10 millisecond search speeds via domain-based partitioning. Furthermore, the dataset includes detailed metadata structure and vector parameter specifications, making it suitable for tasks including information retrieval and feature extraction.

创建时间：

2026-03-13

5,000+

优质数据集

54 个

任务类型

进入经典数据集