ismailemir/arxiv-corpus
收藏Hugging Face2026-01-03 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/ismailemir/arxiv-corpus
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
tags:
- arxiv
- papers
- abstracts
- nlp
- scientific-papers
size_categories:
- 1M<n<10M
---
# ArXiv Paper Abstracts
Collection of ArXiv paper abstracts in JSON format for search and retrieval tasks.
## 📊 Dataset Info
- **Papers**: 2+ million
- **Format**: JSON Lines (one object per line)
- **Source**: [Kaggle ArXiv Dataset](https://www.kaggle.com/datasets/Cornell-University/arxiv)
- **Updated**: Version 266
## 📝 Format
Each line contains a JSON object:
```json
{"abstract": "Paper abstract text..."}
```
## 🚀 Quick Start
```python
from huggingface_hub import hf_hub_download
import json
# Download corpus
corpus_path = hf_hub_download(
repo_id="ismailemir/arxiv-corpus",
filename="arxiv_abstracts.json",
repo_type="dataset"
)
# Load corpus
corpus = []
with open(corpus_path, 'r') as f:
for line in f:
row = json.loads(line)
corpus.append(row["abstract"])
print(f"Loaded {len(corpus):,} abstracts")
```
## 💡 Use Cases
- 📚 Academic paper search
- 🔍 Information retrieval research
- 🤖 Training search models
- 📊 Text mining and analysis
- 🧪 Benchmarking retrieval systems
## 🔗 Related
- 🔍 Search Indices: [ismailemir/arxiv-indices](https://huggingface.co/datasets/ismailemir/arxiv-indices)
- 🔬 Original Dataset: [Cornell ArXiv](https://www.kaggle.com/datasets/Cornell-University/arxiv)
## 📄 License
Apache 2.0 - Please cite ArXiv if using this data.
## 🙏 Citation
```bibtex
@article{clement2019arxiv,
title={On the Use of ArXiv as a Dataset},
author={Clement, Colin B and Bierbaum, Matthew and O'Keeffe, Kevin P and Alemi, Alexander A},
journal={arXiv preprint arXiv:1905.00075},
year={2019}
}
```
## 📧 Contact
For issues or questions, please open an issue on the Hugging Face dataset page.
提供机构:
ismailemir



