five

cometadata/arxiv-software-repo-links

收藏
Hugging Face2026-01-08 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/cometadata/arxiv-software-repo-links
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc0-1.0 task_categories: - text-classification language: - en tags: - arxiv - software - github - code - research - citations - co-citation - clustering size_categories: - 100K<n<1M configs: - config_name: links data_files: - split: train path: data/links/train.jsonl - config_name: repo_stats data_files: - split: train path: data/repo_stats/train.jsonl - config_name: user_stats data_files: - split: train path: data/user_stats/train.jsonl - config_name: repo_cocitations data_files: - split: train path: data/repo_cocitations/train.jsonl - config_name: user_cocitations data_files: - split: train path: data/user_cocitations/train.jsonl - config_name: repo_clusters data_files: - split: train path: data/repo_clusters/train.jsonl - config_name: user_clusters data_files: - split: train path: data/user_clusters/train.jsonl --- # arXiv Software Repository Links A dataset mapping arXiv papers (via DOI) to software repositories they reference or are supplemented by, with co-citation analysis and community clustering. ## Quick Start ```python from datasets import load_dataset # Load DOI-to-repo links links = load_dataset("cometadata/arxiv-software-repo-links", "links") # Load repo citation stats repo_stats = load_dataset("cometadata/arxiv-software-repo-links", "repo_stats") # Load user/org citation stats user_stats = load_dataset("cometadata/arxiv-software-repo-links", "user_stats") # Load co-citation data repo_cocitations = load_dataset("cometadata/arxiv-software-repo-links", "repo_cocitations") user_cocitations = load_dataset("cometadata/arxiv-software-repo-links", "user_cocitations") # Load cluster data repo_clusters = load_dataset("cometadata/arxiv-software-repo-links", "repo_clusters") user_clusters = load_dataset("cometadata/arxiv-software-repo-links", "user_clusters") ``` ## Dataset Description This dataset contains links between arXiv papers and software repositories (primarily GitHub), extracted from the full text and validated through multiple methods, including API checks and repository metadata analysis. ## Configurations | Config | Description | Records | |--------|-------------|---------| | `links` | DOI-to-repository mappings with relationship type | 438,572 | | `repo_stats` | Aggregate citation counts per repository | 263,532 | | `user_stats` | Aggregate citation counts per GitHub user/organization | 141,746 | | `repo_cocitations` | Repository pairs cited together by the same papers | 458,671 | | `user_cocitations` | GitHub user/org pairs cited together by the same papers | 324,533 | | `repo_clusters` | Repository communities detected via Louvain clustering | 819 | | `user_clusters` | GitHub user/org communities detected via Louvain clustering | 181 | ## Schema **links** ```json {"doi": "10.48550/arxiv.2308.11197", "repo_url": "https://github.com/owner/repo", "relation_type": "References"} ``` **repo_stats** ```json {"repo_url": "https://github.com/owner/repo", "citation_count": 42} ``` **user_stats** ```json {"github_user": "facebookresearch", "citation_count": 3847} ``` **repo_cocitations** ```json {"repo_1": "https://github.com/google/jax", "repo_2": "https://github.com/google/flax", "cocitation_count": 144} ``` **user_cocitations** ```json {"github_user_1": "facebookresearch", "github_user_2": "microsoft", "cocitation_count": 84} ``` **repo_clusters / user_clusters** ```json {"cluster_id": 1, "size": 219, "top_members": ["kingoflolz/mesh-transformer-jax", "huggingface/trl", "..."], "members": ["..."]} ``` ## Relationship Types | Type | Description | Count | |------|-------------|-------| | `References` | Paper mentions/cites the repository | 347,087 | | `IsSupplementedBy` | Repository directly supports the paper (e.g., paper's code) | 91,485 | ## Dataset Statistics | Metric | Value | |--------|-------| | Total paper-repo links | 438,572 | | Unique repositories | 263,532 | | Unique GitHub users/orgs | 141,746 | | Unique arXiv DOIs | ~283,000 | | Repo co-citation pairs | 458,671 | | User co-citation pairs | 324,533 | | Repo clusters | 819 | | User clusters | 181 | ### Repository Citation Distribution | Citations | Repositories | |-----------|--------------| | 1 | 214,768 | | 2-5 | 41,482 | | 6-10 | 4,331 | | 11-50 | 2,651 | | 51-100 | 194 | | 100+ | 106 | ### Most Cited Repositories | Repository | Citations | |------------|-----------| | [kingoflolz/mesh-transformer-jax](https://github.com/kingoflolz/mesh-transformer-jax) | 646 | | [pytorch/vision](https://github.com/pytorch/vision) | 601 | | [huggingface/trl](https://github.com/huggingface/trl) | 570 | | [openai/baselines](https://github.com/openai/baselines) | 552 | | [pytorch/pytorch](https://github.com/pytorch/pytorch) | 429 | ### Most Cited GitHub Users/Organizations | User/Org | Citations | |----------|-----------| | [facebookresearch](https://github.com/facebookresearch) | 3,847 | | [microsoft](https://github.com/microsoft) | 2,583 | | [google](https://github.com/google) | 2,389 | | [huggingface](https://github.com/huggingface) | 1,796 | | [openai](https://github.com/openai) | 1,476 | | [pytorch](https://github.com/pytorch) | 1,283 | | [NVIDIA](https://github.com/NVIDIA) | 1,213 | | [google-research](https://github.com/google-research) | 1,005 | | [allenai](https://github.com/allenai) | 900 | | [open-mmlab](https://github.com/open-mmlab) | 890 | ### Top Co-citation Pairs (Repos) | Repo 1 | Repo 2 | Co-citations | |--------|--------|--------------| | MarekKowalski/FaceSwap | deepfakes/faceswap | 161 | | Significant-Gravitas/Auto-GPT | yoheinakajima/babyagi | 61 | | marcotcr/lime | slundberg/shap | 50 | | lofar-astron/prefactor | mhardcastle/ddf-pipeline | 46 | | deepfakes/faceswap | iperov/DeepFaceLab | 45 | ### Top Co-citation Pairs (Users/Orgs) | User 1 | User 2 | Co-citations | |--------|--------|--------------| | MarekKowalski | deepfakes | 162 | | facebookresearch | microsoft | 84 | | facebookresearch | pytorch | 78 | | facebookresearch | google | 62 | | facebookresearch | google-research | 62 | ### Example Repository Clusters | Cluster | Theme | Top Members | |---------|-------|-------------| | 1 | LLMs/Transformers | kingoflolz/mesh-transformer-jax, huggingface/trl, EleutherAI/lm-evaluation-harness | | 2 | Astronomy/Bayesian | SheffieldML/GPy, dfm/emcee, sczesla/PyAstronomy | | 3 | Graph Neural Networks | pyg-team/pytorch_geometric, tkipf/gcn, snap-stanford/ogb | | 4 | Generative Models | deepfakes/faceswap, MarekKowalski/FaceSwap, CompVis/stable-diffusion | | 5 | Deep Learning Frameworks | pytorch/pytorch, onnx/onnx, PyTorchLightning/pytorch-lightning | | 6 | Speech/Audio | kaldi-asr/kaldi, resemble-ai/Resemblyzer, NVIDIA/NeMo | | 7 | Blockchain/Formal Verification | bitcoin/bips, Z3Prover/z3, ethereum/go-ethereum | ### Example User/Org Clusters | Cluster | Theme | Top Members | |---------|-------|-------------| | 1 | ML/CV Research | facebookresearch, openai, google-research, open-mmlab | | 2 | Enterprise ML | microsoft, google, pytorch, tensorflow | | 3 | NLP/Inference | huggingface, NVIDIA, allenai, kingoflolz | | 4 | Graph Learning | snap-stanford, tkipf, DeepGraphLearning, pyg-team | | 5 | Astrophysics | LSSTDESC, cosmodesi, simonsobs, esheldon | | 6 | Blockchain | ethereum, bitcoin, ConsenSys, hyperledger | ## Methodology This dataset was produced using [extract-software-repos](https://github.com/cometadata/extract-software-repos). 1. **Extraction**: Software repository URLs extracted from arXiv paper full-texts using pattern matching 2. **Validation**: URLs validated via git ls-remote, API checks, and HTTP verification 3. **Promotion**: Generic "References" relationships promoted to "IsSupplementedBy" when evidence suggests the repository directly supports the paper, based on: - arXiv ID present in repository README or description - Repository name similarity to paper title - Author matching: Repository contributors matched to paper authors using the [evamxb/dev-author-em-clf](https://huggingface.co/evamxb/dev-author-em-clf) model from [sci-soft-models](https://doi.org/10.5281/zenodo.17401862) 4. **Co-citation Analysis**: Pairs of repositories/users cited by the same paper are counted 5. **Clustering**: Louvain community detection applied to co-citation networks to identify thematic clusters ## Use Cases - Analyzing software adoption in academic research - Tracking research impact of open-source projects - Identifying relationships between papers and codebases - Building citation networks for software - Understanding research community structure through co-citation patterns - Discovering related repositories and research groups ## License CC0 1.0 Universal (Public Domain) ## Citation ```bibtex @dataset{arxiv_software_repo_links, title={arXiv Software Repository Links}, author={COMET Project}, year={2025}, publisher={Hugging Face}, url={https://huggingface.co/datasets/cometadata/arxiv-software-repo-links} } ```
提供机构:
cometadata
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作