arXiv-abstract-model2vec
收藏魔搭社区2025-12-05 更新2025-07-12 收录
下载链接:
https://modelscope.cn/datasets/sleeping-ai/arXiv-abstract-model2vec
下载链接
链接失效反馈官方服务:
资源简介:
# arXiv-model2vec Datase
The **arXiv-model2vec** dataset contains embeddings for all arXiv paper abstracts and their corresponding titles, generated using the **Model2Vec proton-8M** model. This dataset is released to support research projects that require high-quality semantic representations of scientific papers.
## Dataset Details
- **Source**: The dataset includes embeddings derived from arXiv paper titles and abstracts.
- **Embedding Model**: The embeddings are generated using the **Model2Vec proton-8M** model, which provides dense vector representations of text.
- **Content**:
- **Title Embeddings**: Vector representations of the titles of arXiv papers.
- **Abstract Embeddings**: Vector representations of the abstracts of arXiv papers.
## Intended Use Cases
This dataset can be used for a variety of research applications, including:
- **Semantic Search**: Improve search systems by leveraging the semantic information captured in the embeddings.
- **Clustering and Classification**: Group or categorize papers based on their content similarity.
- **Recommendation Systems**: Build systems that recommend relevant papers to users based on their interests.
- **Trend Analysis**: Analyze trends in scientific research by examining changes in the embeddings over time.
## Dataset Release
The **arXiv-model2vec** dataset is made available for academic and research purposes. We hope this resource will aid researchers in exploring new insights and advancing the field of scientific literature analysis.
For any questions or further information, feel free to reach out.
---
# arXiv-model2vec 数据集
**arXiv-model2vec** 数据集包含所有arXiv论文摘要及其对应标题的嵌入向量(embeddings),这些向量由**Model2Vec proton-8M** 模型生成。本数据集发布旨在支持需要高质量科学论文语义表征的研究项目。
## 数据集详情
- **来源**:本数据集包含从arXiv论文标题和摘要提取得到的嵌入向量。
- **嵌入模型**:嵌入向量由**Model2Vec proton-8M** 模型生成,该模型可生成文本的稠密向量表征。
- **内容**:
- **标题嵌入向量**:arXiv论文标题的向量表征。
- **摘要嵌入向量**:arXiv论文摘要的向量表征。
## 预期应用场景
本数据集可应用于多种研究场景,包括:
- **语义搜索**:利用嵌入向量捕获的语义信息优化搜索系统性能。
- **聚类与分类**:基于内容相似度对论文进行分组或归类。
- **推荐系统**:构建可根据用户兴趣向其推荐相关论文的系统。
- **趋势分析**:通过追踪嵌入向量随时间的变化,开展科学研究趋势分析。
## 数据集发布说明
**arXiv-model2vec** 数据集仅面向学术与研究用途开放。我们期望本资源能够助力研究者探索新的科学洞察,推动科学文献分析领域的发展。
如有任何疑问或需要进一步信息,请随时联系我们。
提供机构:
maas
创建时间:
2025-07-07



