lwaekfjlk/artifact-bench

Name: lwaekfjlk/artifact-bench
Creator: lwaekfjlk
Published: 2026-04-21 06:31:01
License: 暂无描述

Hugging Face2026-04-21 更新2026-04-26 收录

下载链接：

https://hf-mirror.com/datasets/lwaekfjlk/artifact-bench

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-4.0 language: - en tags: - graph - link-prediction - benchmark - model-dataset size_categories: - 10K<n<100K --- # Artifact Graph A heterogeneous graph of HuggingFace model/dataset/paper/codebase nodes with observed (model, dataset, performance-metric) evaluation edges, used to benchmark link prediction and attribute regression. ## Contents | path | description | |-----------------------------------|-------------| | `full/` | Full unsplit graph: all nodes + all edges (by type) | | `transductive/` | All nodes visible in both train and test; edges split | | `inductive/` | Disjoint node partition: some nodes train-only, others test-only | ## Full graph (`full/`) | file | description | |---------------------------------------|-------------| | `node_metadata.json` | Per-node `{type, name, downloads, info}` for all 14K nodes | | `node_mappings.json` | Integer ID ↔ HuggingFace ID mapping | | `node_embeddings_voyage.npy` | Voyage-3 embeddings, `(N, 1024)` | | `node_embeddings_random.npy` | L2-normalised random embeddings | | `edges.npz` | All edges combined, `(2, E)` | | `edges_eval.npz` | model × dataset evaluation edges | | `edges_base_model.npz` | model → base_model edges | | `edges_resource.npz` | model/dataset → paper/codebase edges | | `edge_metadata.json` | Raw (model, dataset, metric) edge records | | `edge_metadata_normalized.json` | Eval edges with metrics normalised to `[0, 1]` | | `edge_metadata_eval.json` | Eval-edge metadata only | | `edge_metadata_base_model.json` | base-model edge metadata | | `edge_metadata_resource.json` | paper / codebase resource edge metadata | Each split directory contains: | file | description | |---------------------------------|-------------| | `node_embeddings_voyage.npy` | Voyage-3 embeddings, shape `(N, 1024)` | | `node_embeddings_random.npy` | L2-normalised random embeddings, same shape | | `split_info.json` | Split metadata (seed, counts, dates) | | `node_split.json` (inductive) | Per-node train/test assignment | | `train_split/` | Training subgraph (see below) | | `test_split/` | Test subgraph (held-out eval edges) | Each `{train,test}_split/` holds: | file | description | |-----------------------------------|-------------| | `node_metadata.json` | Per-node `{type, name, downloads, info}` | | `edge_metadata_normalized.json` | Normalized `(u,v) → metric: value` map | | `edges.npz` | Message-passing edges, `edges` key, shape `(2, E)` | | `pos_edges.npz` | Positive eval edges (model × dataset with metric) | ## Node types - `model`: HuggingFace models (e.g., `sileod/deberta-v3-large-tasksource-nli`) - `dataset`: HuggingFace datasets (e.g., `nyu-mll/multi_nli`) - `paper`: referenced papers (arXiv IDs) - `codebase`: linked repositories ## Edge types - `model ↔ dataset` (eval): accuracy / F1 / BLEU / etc. (normalized to `[0, 1]`) - `model ↔ paper`, `model ↔ codebase`, `dataset ↔ paper`, `dataset ↔ codebase`: resource links - `model ↔ model`: base-model / fine-tune relations ## Usage ```python from huggingface_hub import snapshot_download path = snapshot_download("lwaekfjlk/artifact-graph", repo_type="dataset") import numpy as np, json emb = np.load(f"{path}/transductive/node_embeddings_voyage.npy") nm = json.load(open(f"{path}/transductive/train_split/node_metadata.json")) pe = np.load(f"{path}/transductive/train_split/pos_edges.npz")["edges"] print(emb.shape, len(nm), pe.shape) ``` ## Case study: NLI (`case_study_nli/`) Frozen 576-cell evaluation grid used as the NLI case study in our paper: 48 NLI models × 12 NLI datasets. Each cell was produced by an LLM-coder pipeline that emitted per-example predictions and a top-level accuracy. ### Layout | path | description | |---|---| | `case_study_nli/raw_evals/<model>_<dataset>_accuracy/predictions.json` | Per-example `{idx, prediction, ground_truth}` | | `case_study_nli/raw_evals/<model>_<dataset>_accuracy/results.json` | `{accuracy: float}` (plus `previous_accuracy` for 9 bug cells) | | `case_study_nli/all_results_summary_fixed.json` | Cleaned aggregate: 576 rows, 9 bug fixes applied, masked flags for cells that cannot be scored 3-way | | `case_study_nli/scripts/rebuild_nli_summary.py` | Raw → fixed aggregate | | `case_study_nli/scripts/plot_nli_heatmap.py` | 45-model heatmap (3 models with degenerate cells excluded by `--min-cell 0.05`) | | `case_study_nli/scripts/plot_nli_matrix_scree.py` | Double-centered SVD scree plot | | `case_study_nli/figures/nli_results_heatmap.{png,pdf}` | Main heatmap | | `case_study_nli/figures/nli_matrix_scree.{png,pdf}` | Scree plot | ### Known issues in the raw evaluations 1. **9 bug-fix cells**: top-level `accuracy=0` was overwritten but `previous_accuracy>0` holds the real value. The fixed aggregate uses `previous_accuracy`. 2. **Binary-output models on 3-way datasets**: three zero-shot classifiers only emit 2 labels; their MNLI / SNLI / ANLI / NLI_FEVER cells are masked in the aggregate (not directly comparable to 3-way models). 3. **2 true failures**: `microsoft/deberta-v3-base` on `allenai/scitail` and `araag2/MedNLI` produced degenerate predictions. ### Reproducibility note The per-cell evaluation scripts were not uniformly persisted to disk — cells run in the January batch retained them, but the April re-runs (the majority) executed inline via an agent and only wrote results back. We therefore ship just the frozen outputs (`predictions.json` + `results.json`) rather than an incomplete script set. The processing scripts in `case_study_nli/scripts/` are sufficient to regenerate the aggregate and figures from the per-cell outputs. ### Reproduce aggregate + figures ```bash pip install datasets numpy matplotlib huggingface_hub python scripts/rebuild_nli_summary.py \ --src case_study_nli/raw_evals \ --out case_study_nli/all_results_summary_fixed.json python scripts/plot_nli_heatmap.py \ --input case_study_nli/all_results_summary_fixed.json \ --out-dir case_study_nli/figures python scripts/plot_nli_matrix_scree.py \ --input case_study_nli/all_results_summary_fixed.json \ --out-dir case_study_nli/figures ``` ## Verification bench (`verification_bench/`) Full agent-based eval reproductions: 263 (model, dataset, metric) cells drawn from a stratified "hard" sample of the artifact graph. A skill-based multi-agent system (driver: GPT-5.2, tool mode: multiturn_metadatatool) attempts to reproduce each published accuracy score by locating the dataset, loading the model, writing an eval script, and reporting a metric. ### Layout ``` verification_bench/ └── skills_multiagent_gpt-5.2_metadatatool/ └── <model>_<dataset>_<metric>/ ├── metadata.json # (model, dataset, metric) spec ├── run_eval.py # agent-written evaluation script ├── predictions.json # per-example predictions ├── results.json # top-level metric value ├── run.log # agent trajectory log └── results_full.json # rich metric breakdown (4 cells only) ``` ### Use ```python import json, os ROOT = "verification_bench/skills_multiagent_gpt-5.2_metadatatool" for cell in os.listdir(ROOT): meta = json.load(open(f"{ROOT}/{cell}/metadata.json")) result = json.load(open(f"{ROOT}/{cell}/results.json")) print(meta["model_id"], meta["dataset_id"], result) ``` ### Notes - 263 / 266 cell dirs contain a complete `results.json`; the remaining 3 failed with agent / runtime errors. - Cell directories are named `<model>_<dataset>_<metric>` with `/` replaced by `_` in HuggingFace IDs. - This suite is the best-performing agent configuration we evaluated (156 cells above accuracy 0.5, 97 above 0.8); scores are properly normalised to `[0, 1]`.

许可证：CC BY 4.0 语言： - 英语标签： - 图 - 链接预测（link prediction） - 基准测试 - 模型-数据集规模类别： - 10K < 样本量 < 100K # 工件图（Artifact Graph）异构图，包含Hugging Face模型、数据集、论文、代码仓库四类节点，以及观测得到的（模型、数据集、性能指标）评估边，用于链接预测与属性回归任务的基准测试。 ## 内容 | 路径 | 描述 | |-----------------------------------|-------------| | `full/` | 完整未拆分图：包含所有节点与所有按类型划分的边 | | `transductive/` | 训练与测试集均可见全部节点；边已拆分 | | `inductive/` | 不相交节点划分：部分节点仅用于训练，其余仅用于测试 | ## 全图（`full/`） | 文件 | 描述 | |---------------------------------------|-------------| | `node_metadata.json` | 所有14,000个节点的单节点元数据：`{类型、名称、下载量、信息}` | | `node_mappings.json` | 整数ID ↔ Hugging Face ID 映射表 | | `node_embeddings_voyage.npy` | Voyage-3 嵌入向量，形状为`(N, 1024)` | | `node_embeddings_random.npy` | L2归一化随机嵌入向量 | | `edges.npz` | 所有边的合并文件，格式为`(2, E)` | | `edges_eval.npz` | 模型 × 数据集的评估边 | | `edges_base_model.npz` | 模型 → 基座模型边 | | `edges_resource.npz` | 模型/数据集 → 论文/代码仓库边 | | `edge_metadata.json` | 原始（模型、数据集、指标）边记录 | | `edge_metadata_normalized.json` | 指标已归一化至`[0, 1]`区间的评估边元数据 | | `edge_metadata_eval.json` | 仅包含评估边的元数据 | | `edge_metadata_base_model.json` | 基座模型边元数据 | | `edge_metadata_resource.json` | 论文/代码仓库资源边元数据 | 每个拆分目录均包含以下文件： | 文件 | 描述 | |---------------------------------|-------------| | `node_embeddings_voyage.npy` | Voyage-3 嵌入向量，形状为`(N, 1024)` | | `node_embeddings_random.npy` | L2归一化随机嵌入向量，形状一致 | | `split_info.json` | 拆分元数据（随机种子、样本数、日期） | | `node_split.json`（归纳式拆分） | 单节点训练/测试集分配信息 | | `train_split/` | 训练子图（详见下文） | | `test_split/` | 测试子图（留存评估边） | 每个`{train,test}_split/`目录包含： | 文件 | 描述 | |-----------------------------------|-------------| | `node_metadata.json` | 单节点元数据：`{类型、名称、下载量、信息}` | | `edge_metadata_normalized.json` | 归一化的`(u,v) → 指标值`映射表 | | `edges.npz` | 消息传递边，`edges`键对应形状为`(2, E)`的数组 | | `pos_edges.npz` | 正样本评估边（带有指标的模型×数据集边） | ## 节点类型 - `model`：Hugging Face模型（例如：`sileod/deberta-v3-large-tasksource-nli`） - `dataset`：Hugging Face数据集（例如：`nyu-mll/multi_nli`） - `paper`：引用论文（arXiv ID） - `codebase`：关联代码仓库 ## 边类型 - `模型 ↔ 数据集`（评估）：准确率、F1值、BLEU值等（已归一化至`[0, 1]`区间） - `模型 ↔ 论文`、`模型 ↔ 代码仓库`、`数据集 ↔ 论文`、`数据集 ↔ 代码仓库`：资源关联边 - `模型 ↔ 模型`：基座模型/微调关系边 ## 使用方法 python from huggingface_hub import snapshot_download path = snapshot_download("lwaekfjlk/artifact-graph", repo_type="dataset") import numpy as np, json emb = np.load(f"{path}/transductive/node_embeddings_voyage.npy") nm = json.load(open(f"{path}/transductive/train_split/node_metadata.json")) pe = np.load(f"{path}/transductive/train_split/pos_edges.npz")["edges"] print(emb.shape, len(nm), pe.shape) ## 案例研究：自然语言推理（NLI，Natural Language Inference）`case_study_nli/` 该部分为固定的576单元格评估网格，用于本文中的NLI案例研究：包含48个NLI模型 × 12个NLI数据集。每个单元格由大语言模型编码流水线生成，包含单样本预测结果与顶层准确率。 ### 布局 | 路径 | 描述 | |---|---| | `case_study_nli/raw_evals/<model>_<dataset>_accuracy/predictions.json` | 单样本元数据：`{idx, prediction, ground_truth}` | | `case_study_nli/raw_evals/<model>_<dataset>_accuracy/results.json` | 结果元数据：`{accuracy: float}`（9个存在缺陷的单元格包含`previous_accuracy`字段存储真实值） | | `case_study_nli/all_results_summary_fixed.json` | 清洗后的聚合结果：共576行，已修复9个缺陷单元格，对无法进行3分类评分的单元格添加了掩码标记 | | `case_study_nli/scripts/rebuild_nli_summary.py` | 原始数据→修复后聚合结果的转换脚本 | | `case_study_nli/scripts/plot_nli_heatmap.py` | 绘制45个模型的热力图（通过`--min-cell 0.05`参数排除3个退化单元格的模型） | | `case_study_nli/scripts/plot_nli_matrix_scree.py` | 双中心化SVD碎石图 | | `case_study_nli/figures/nli_results_heatmap.{png,pdf}` | 主热力图 | | `case_study_nli/figures/nli_matrix_scree.{png,pdf}` | 碎石图 | ### 原始评估中的已知问题 1. **9个缺陷修复单元格**：顶层`accuracy=0`被错误覆盖，`previous_accuracy>0`字段存储了真实值。修复后的聚合结果使用`previous_accuracy`字段的值。 2. **二分类模型在三分类数据集上的问题**：3个零样本分类器仅输出2个标签；其在MNLI / SNLI / ANLI / NLI_FEVER数据集上的单元格被排除在聚合结果之外（无法与三分类模型直接比较）。 3. **2个真正的失败案例**：`microsoft/deberta-v3-base`在`allenai/scitail`与`araag2/MedNLI`数据集上生成了退化的预测结果。 ### 可复现性说明每个单元格的评估脚本并未统一持久化到磁盘：1月批次运行的单元格保留了脚本，但4月的重新运行（绝大多数）通过智能体在线执行，仅写入了结果文件。因此我们仅提供固化的输出结果（`predictions.json` + `results.json`），而非不完整的脚本集。`case_study_nli/scripts/`目录下的处理脚本足以从单单元格输出中重新生成聚合结果与图表。 ### 重现聚合结果与图表 bash pip install datasets numpy matplotlib huggingface_hub python scripts/rebuild_nli_summary.py --src case_study_nli/raw_evals --out case_study_nli/all_results_summary_fixed.json python scripts/plot_nli_heatmap.py --input case_study_nli/all_results_summary_fixed.json --out-dir case_study_nli/figures python scripts/plot_nli_matrix_scree.py --input case_study_nli/all_results_summary_fixed.json --out-dir case_study_nli/figures ## 验证基准（Verification Bench）`verification_bench/` 完整的智能体评估复现套件：从工件图的分层“困难”样本中抽取的263个（模型、数据集、指标）单元格。基于技能的多智能体系统（驱动模型：GPT-5.2，工具模式：多轮元数据工具）尝试复现每个已发布的准确率得分，流程包括定位数据集、加载模型、编写评估脚本并报告指标值。 ### 目录结构 verification_bench/ └── skills_multiagent_gpt-5.2_metadatatool/ └── <model>_<dataset>_<metric>/ ├── metadata.json # （模型、数据集、指标）规格文件 ├── run_eval.py # 智能体编写的评估脚本 ├── predictions.json # 单样本预测结果 ├── results.json # 顶层指标值 ├── run.log # 智能体运行轨迹日志 └── results_full.json # 丰富的指标细分（仅4个单元格包含） ### 使用方法 python import json, os ROOT = "verification_bench/skills_multiagent_gpt-5.2_metadatatool" for cell in os.listdir(ROOT): meta = json.load(open(f"{ROOT}/{cell}/metadata.json")) result = json.load(open(f"{ROOT}/{cell}/results.json")) print(meta["model_id"], meta["dataset_id"], result) ### 注意事项 - 263 / 266个单元格目录包含完整的`results.json`文件；其余3个因智能体/运行时错误失败。 - 单元格目录以`<model>_<dataset>_<metric>`命名，Hugging Face ID中的`/`被替换为`_`。 - 该套件为我们评估过的性能最优的智能体配置（156个单元格准确率高于0.5，97个高于0.8）；得分已正确归一化至`[0, 1]`区间。

提供机构：

lwaekfjlk

5,000+

优质数据集

54 个

任务类型

进入经典数据集