Augmented Citation Graph and Hierarchical Community Labels for Scientific Paper Retrieval Evaluation
收藏DataCite Commons2026-05-06 更新2026-05-07 收录
下载链接:
https://zenodo.org/doi/10.5281/zenodo.20046263
下载链接
链接失效反馈官方服务:
资源简介:
Augmented Citation Graph + Hierarchical Community Labels
Data accompanying an anonymous NeurIPS 2026 submission on retrieval methods for scientific paper recommendation. This deposit contains all large derived data artifacts required to reproduce the paper's main tables and figures.
What is included
- Augmented citation graph (`augmented_graph_v2.parquet`, 1.9 GB): a single weighted undirected graph over 3.58 M scientific papers with 153.18 M edges. Layers combined: direct citation (target→target), bibliographic coupling (Salton cosine over the full ~150 M-paper reference table), and co-citation (Salton cosine).- Bibliographic-coupling and co-citation edges, also provided separately for sensitivity analysis (`bc_edges_full.parquet`, 1.3 GB; `cc_edges_full.parquet`, 256 MB).- Direct target↔target citation graph (`citation_graph.parquet`, 96 MB).- Target corpus (`target/neurips_4m.parquet`, 1.8 GB): 4 M scientific papers spanning 8 domains (biology, biomedical, chemistry, computer science, engineering, environmental_earth, materials science, physics) with title, abstract, and publication-year metadata.- Hierarchical community labels (`communities_augmented_v2/`, 270 MB): - Level 1 (sub-field): Leiden CPM γ = 1e-4 over the full graph, 73,477 communities. - Level 2 (research agenda): Leiden CPM γ = 1e-2 inside each Level-1 community's induced subgraph, 328,738 fine-grained communities (median non-singleton size 7, max 1,712). - Plus eight γ-sweep parquets (1e-6 .. 1e-2) for resolution sensitivity.- Duplicate clusters (`dedup_v2_clusters.json`): 7,287 boilerplate-filtered duplicate clusters used to drop non-canonical papers from the evaluation pool.- A `README.md` with full schemas and column descriptions.
What is NOT included (and why)
- Pre-computed embedding parquets (qwen3 0.6B, qwen3 8B, SPECTER2, Gemini text-embedding) total ~62 GB and exceed the Zenodo free-tier limit. They are reproducible from the open-source models and APIs by running `code/embeddings/embed_*.py` in the accompanying code repository.- Raw `paper_reference` table for BC/CC re-derivation. Equivalent data is publicly available from the OpenAlex bulk export (`works.referenced_works`) — we used the 2026-03-31 snapshot.
Reproducibility
The accompanying anonymous code repository (URL in the paper appendix) contains all scripts required to:
1. Recompute BC and CC edges from a `paper_reference` table (DuckDB out-of-core, ~3 minutes)2. Build the augmented graph (~5 minutes)3. Run the Leiden CPM γ sweep + hierarchical Level-2 step (~3 hours total on 22 vCPU)4. Generate paper embeddings with each of the four target models5. Run the 80-query retrieval benchmark, hybrid analysis (citation rerank, RRF), and lexical-divergence diagnostics
Schemas of every file in this deposit are documented in `README.md` inside the zip. All ID columns are 64-bit integer paper identifiers (OpenAlex-style).
Citation
Anonymous NeurIPS 2026 submission. Please cite this deposit by its DOI.
License
CC-BY-4.0
提供机构:
Zenodo
创建时间:
2026-05-06



