cometadata/arxiv-software-repo-links
收藏Hugging Face2026-01-08 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/cometadata/arxiv-software-repo-links
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc0-1.0
task_categories:
- text-classification
language:
- en
tags:
- arxiv
- software
- github
- code
- research
- citations
- co-citation
- clustering
size_categories:
- 100K<n<1M
configs:
- config_name: links
data_files:
- split: train
path: data/links/train.jsonl
- config_name: repo_stats
data_files:
- split: train
path: data/repo_stats/train.jsonl
- config_name: user_stats
data_files:
- split: train
path: data/user_stats/train.jsonl
- config_name: repo_cocitations
data_files:
- split: train
path: data/repo_cocitations/train.jsonl
- config_name: user_cocitations
data_files:
- split: train
path: data/user_cocitations/train.jsonl
- config_name: repo_clusters
data_files:
- split: train
path: data/repo_clusters/train.jsonl
- config_name: user_clusters
data_files:
- split: train
path: data/user_clusters/train.jsonl
---
# arXiv Software Repository Links
A dataset mapping arXiv papers (via DOI) to software repositories they reference or are supplemented by, with co-citation analysis and community clustering.
## Quick Start
```python
from datasets import load_dataset
# Load DOI-to-repo links
links = load_dataset("cometadata/arxiv-software-repo-links", "links")
# Load repo citation stats
repo_stats = load_dataset("cometadata/arxiv-software-repo-links", "repo_stats")
# Load user/org citation stats
user_stats = load_dataset("cometadata/arxiv-software-repo-links", "user_stats")
# Load co-citation data
repo_cocitations = load_dataset("cometadata/arxiv-software-repo-links", "repo_cocitations")
user_cocitations = load_dataset("cometadata/arxiv-software-repo-links", "user_cocitations")
# Load cluster data
repo_clusters = load_dataset("cometadata/arxiv-software-repo-links", "repo_clusters")
user_clusters = load_dataset("cometadata/arxiv-software-repo-links", "user_clusters")
```
## Dataset Description
This dataset contains links between arXiv papers and software repositories (primarily GitHub), extracted from the full text and validated through multiple methods, including API checks and repository metadata analysis.
## Configurations
| Config | Description | Records |
|--------|-------------|---------|
| `links` | DOI-to-repository mappings with relationship type | 438,572 |
| `repo_stats` | Aggregate citation counts per repository | 263,532 |
| `user_stats` | Aggregate citation counts per GitHub user/organization | 141,746 |
| `repo_cocitations` | Repository pairs cited together by the same papers | 458,671 |
| `user_cocitations` | GitHub user/org pairs cited together by the same papers | 324,533 |
| `repo_clusters` | Repository communities detected via Louvain clustering | 819 |
| `user_clusters` | GitHub user/org communities detected via Louvain clustering | 181 |
## Schema
**links**
```json
{"doi": "10.48550/arxiv.2308.11197", "repo_url": "https://github.com/owner/repo", "relation_type": "References"}
```
**repo_stats**
```json
{"repo_url": "https://github.com/owner/repo", "citation_count": 42}
```
**user_stats**
```json
{"github_user": "facebookresearch", "citation_count": 3847}
```
**repo_cocitations**
```json
{"repo_1": "https://github.com/google/jax", "repo_2": "https://github.com/google/flax", "cocitation_count": 144}
```
**user_cocitations**
```json
{"github_user_1": "facebookresearch", "github_user_2": "microsoft", "cocitation_count": 84}
```
**repo_clusters / user_clusters**
```json
{"cluster_id": 1, "size": 219, "top_members": ["kingoflolz/mesh-transformer-jax", "huggingface/trl", "..."], "members": ["..."]}
```
## Relationship Types
| Type | Description | Count |
|------|-------------|-------|
| `References` | Paper mentions/cites the repository | 347,087 |
| `IsSupplementedBy` | Repository directly supports the paper (e.g., paper's code) | 91,485 |
## Dataset Statistics
| Metric | Value |
|--------|-------|
| Total paper-repo links | 438,572 |
| Unique repositories | 263,532 |
| Unique GitHub users/orgs | 141,746 |
| Unique arXiv DOIs | ~283,000 |
| Repo co-citation pairs | 458,671 |
| User co-citation pairs | 324,533 |
| Repo clusters | 819 |
| User clusters | 181 |
### Repository Citation Distribution
| Citations | Repositories |
|-----------|--------------|
| 1 | 214,768 |
| 2-5 | 41,482 |
| 6-10 | 4,331 |
| 11-50 | 2,651 |
| 51-100 | 194 |
| 100+ | 106 |
### Most Cited Repositories
| Repository | Citations |
|------------|-----------|
| [kingoflolz/mesh-transformer-jax](https://github.com/kingoflolz/mesh-transformer-jax) | 646 |
| [pytorch/vision](https://github.com/pytorch/vision) | 601 |
| [huggingface/trl](https://github.com/huggingface/trl) | 570 |
| [openai/baselines](https://github.com/openai/baselines) | 552 |
| [pytorch/pytorch](https://github.com/pytorch/pytorch) | 429 |
### Most Cited GitHub Users/Organizations
| User/Org | Citations |
|----------|-----------|
| [facebookresearch](https://github.com/facebookresearch) | 3,847 |
| [microsoft](https://github.com/microsoft) | 2,583 |
| [google](https://github.com/google) | 2,389 |
| [huggingface](https://github.com/huggingface) | 1,796 |
| [openai](https://github.com/openai) | 1,476 |
| [pytorch](https://github.com/pytorch) | 1,283 |
| [NVIDIA](https://github.com/NVIDIA) | 1,213 |
| [google-research](https://github.com/google-research) | 1,005 |
| [allenai](https://github.com/allenai) | 900 |
| [open-mmlab](https://github.com/open-mmlab) | 890 |
### Top Co-citation Pairs (Repos)
| Repo 1 | Repo 2 | Co-citations |
|--------|--------|--------------|
| MarekKowalski/FaceSwap | deepfakes/faceswap | 161 |
| Significant-Gravitas/Auto-GPT | yoheinakajima/babyagi | 61 |
| marcotcr/lime | slundberg/shap | 50 |
| lofar-astron/prefactor | mhardcastle/ddf-pipeline | 46 |
| deepfakes/faceswap | iperov/DeepFaceLab | 45 |
### Top Co-citation Pairs (Users/Orgs)
| User 1 | User 2 | Co-citations |
|--------|--------|--------------|
| MarekKowalski | deepfakes | 162 |
| facebookresearch | microsoft | 84 |
| facebookresearch | pytorch | 78 |
| facebookresearch | google | 62 |
| facebookresearch | google-research | 62 |
### Example Repository Clusters
| Cluster | Theme | Top Members |
|---------|-------|-------------|
| 1 | LLMs/Transformers | kingoflolz/mesh-transformer-jax, huggingface/trl, EleutherAI/lm-evaluation-harness |
| 2 | Astronomy/Bayesian | SheffieldML/GPy, dfm/emcee, sczesla/PyAstronomy |
| 3 | Graph Neural Networks | pyg-team/pytorch_geometric, tkipf/gcn, snap-stanford/ogb |
| 4 | Generative Models | deepfakes/faceswap, MarekKowalski/FaceSwap, CompVis/stable-diffusion |
| 5 | Deep Learning Frameworks | pytorch/pytorch, onnx/onnx, PyTorchLightning/pytorch-lightning |
| 6 | Speech/Audio | kaldi-asr/kaldi, resemble-ai/Resemblyzer, NVIDIA/NeMo |
| 7 | Blockchain/Formal Verification | bitcoin/bips, Z3Prover/z3, ethereum/go-ethereum |
### Example User/Org Clusters
| Cluster | Theme | Top Members |
|---------|-------|-------------|
| 1 | ML/CV Research | facebookresearch, openai, google-research, open-mmlab |
| 2 | Enterprise ML | microsoft, google, pytorch, tensorflow |
| 3 | NLP/Inference | huggingface, NVIDIA, allenai, kingoflolz |
| 4 | Graph Learning | snap-stanford, tkipf, DeepGraphLearning, pyg-team |
| 5 | Astrophysics | LSSTDESC, cosmodesi, simonsobs, esheldon |
| 6 | Blockchain | ethereum, bitcoin, ConsenSys, hyperledger |
## Methodology
This dataset was produced using [extract-software-repos](https://github.com/cometadata/extract-software-repos).
1. **Extraction**: Software repository URLs extracted from arXiv paper full-texts using pattern matching
2. **Validation**: URLs validated via git ls-remote, API checks, and HTTP verification
3. **Promotion**: Generic "References" relationships promoted to "IsSupplementedBy" when evidence suggests the repository directly supports the paper, based on:
- arXiv ID present in repository README or description
- Repository name similarity to paper title
- Author matching: Repository contributors matched to paper authors using the [evamxb/dev-author-em-clf](https://huggingface.co/evamxb/dev-author-em-clf) model from [sci-soft-models](https://doi.org/10.5281/zenodo.17401862)
4. **Co-citation Analysis**: Pairs of repositories/users cited by the same paper are counted
5. **Clustering**: Louvain community detection applied to co-citation networks to identify thematic clusters
## Use Cases
- Analyzing software adoption in academic research
- Tracking research impact of open-source projects
- Identifying relationships between papers and codebases
- Building citation networks for software
- Understanding research community structure through co-citation patterns
- Discovering related repositories and research groups
## License
CC0 1.0 Universal (Public Domain)
## Citation
```bibtex
@dataset{arxiv_software_repo_links,
title={arXiv Software Repository Links},
author={COMET Project},
year={2025},
publisher={Hugging Face},
url={https://huggingface.co/datasets/cometadata/arxiv-software-repo-links}
}
```
提供机构:
cometadata



