five

Augmented Citation Graph and Hierarchical Community Labels for Scientific Paper Retrieval Evaluation

收藏
DataCite Commons2026-05-06 更新2026-05-07 收录
下载链接:
https://zenodo.org/doi/10.5281/zenodo.20046263
下载链接
链接失效反馈
官方服务:
资源简介:
Augmented Citation Graph + Hierarchical Community Labels Data accompanying an anonymous NeurIPS 2026 submission on retrieval methods for scientific paper recommendation. This deposit contains all large derived data artifacts required to reproduce the paper's main tables and figures. What is included - Augmented citation graph (`augmented_graph_v2.parquet`, 1.9 GB): a single weighted undirected graph over 3.58 M scientific papers with 153.18 M edges. Layers combined: direct citation (target→target), bibliographic coupling (Salton cosine over the full ~150 M-paper reference table), and co-citation (Salton cosine).- Bibliographic-coupling and co-citation edges, also provided separately for sensitivity analysis (`bc_edges_full.parquet`, 1.3 GB; `cc_edges_full.parquet`, 256 MB).- Direct target↔target citation graph (`citation_graph.parquet`, 96 MB).- Target corpus (`target/neurips_4m.parquet`, 1.8 GB): 4 M scientific papers spanning 8 domains (biology, biomedical, chemistry, computer science, engineering, environmental_earth, materials science, physics) with title, abstract, and publication-year metadata.- Hierarchical community labels (`communities_augmented_v2/`, 270 MB):    - Level 1 (sub-field): Leiden CPM γ = 1e-4 over the full graph, 73,477 communities.    - Level 2 (research agenda): Leiden CPM γ = 1e-2 inside each Level-1 community's induced subgraph, 328,738 fine-grained communities (median non-singleton size 7, max 1,712).    - Plus eight γ-sweep parquets (1e-6 .. 1e-2) for resolution sensitivity.- Duplicate clusters (`dedup_v2_clusters.json`): 7,287 boilerplate-filtered duplicate clusters used to drop non-canonical papers from the evaluation pool.- A `README.md` with full schemas and column descriptions. What is NOT included (and why) - Pre-computed embedding parquets (qwen3 0.6B, qwen3 8B, SPECTER2, Gemini text-embedding) total ~62 GB and exceed the Zenodo free-tier limit. They are reproducible from the open-source models and APIs by running `code/embeddings/embed_*.py` in the accompanying code repository.- Raw `paper_reference` table for BC/CC re-derivation. Equivalent data is publicly available from the OpenAlex bulk export (`works.referenced_works`) — we used the 2026-03-31 snapshot. Reproducibility The accompanying anonymous code repository (URL in the paper appendix) contains all scripts required to: 1. Recompute BC and CC edges from a `paper_reference` table (DuckDB out-of-core, ~3 minutes)2. Build the augmented graph (~5 minutes)3. Run the Leiden CPM γ sweep + hierarchical Level-2 step (~3 hours total on 22 vCPU)4. Generate paper embeddings with each of the four target models5. Run the 80-query retrieval benchmark, hybrid analysis (citation rerank, RRF), and lexical-divergence diagnostics Schemas of every file in this deposit are documented in `README.md` inside the zip. All ID columns are 64-bit integer paper identifiers (OpenAlex-style). Citation Anonymous NeurIPS 2026 submission. Please cite this deposit by its DOI. License CC-BY-4.0
提供机构:
Zenodo
创建时间:
2026-05-06
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作