lwaekfjlk/artifact-bench
收藏Hugging Face2026-04-21 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/lwaekfjlk/artifact-bench
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-4.0
language:
- en
tags:
- graph
- link-prediction
- benchmark
- model-dataset
size_categories:
- 10K<n<100K
---
# Artifact Graph
A heterogeneous graph of HuggingFace model/dataset/paper/codebase nodes with
observed (model, dataset, performance-metric) evaluation edges, used to
benchmark link prediction and attribute regression.
## Contents
| path | description |
|-----------------------------------|-------------|
| `full/` | Full unsplit graph: all nodes + all edges (by type) |
| `transductive/` | All nodes visible in both train and test; edges split |
| `inductive/` | Disjoint node partition: some nodes train-only, others test-only |
## Full graph (`full/`)
| file | description |
|---------------------------------------|-------------|
| `node_metadata.json` | Per-node `{type, name, downloads, info}` for all 14K nodes |
| `node_mappings.json` | Integer ID ↔ HuggingFace ID mapping |
| `node_embeddings_voyage.npy` | Voyage-3 embeddings, `(N, 1024)` |
| `node_embeddings_random.npy` | L2-normalised random embeddings |
| `edges.npz` | All edges combined, `(2, E)` |
| `edges_eval.npz` | model × dataset evaluation edges |
| `edges_base_model.npz` | model → base_model edges |
| `edges_resource.npz` | model/dataset → paper/codebase edges |
| `edge_metadata.json` | Raw (model, dataset, metric) edge records |
| `edge_metadata_normalized.json` | Eval edges with metrics normalised to `[0, 1]` |
| `edge_metadata_eval.json` | Eval-edge metadata only |
| `edge_metadata_base_model.json` | base-model edge metadata |
| `edge_metadata_resource.json` | paper / codebase resource edge metadata |
Each split directory contains:
| file | description |
|---------------------------------|-------------|
| `node_embeddings_voyage.npy` | Voyage-3 embeddings, shape `(N, 1024)` |
| `node_embeddings_random.npy` | L2-normalised random embeddings, same shape |
| `split_info.json` | Split metadata (seed, counts, dates) |
| `node_split.json` (inductive) | Per-node train/test assignment |
| `train_split/` | Training subgraph (see below) |
| `test_split/` | Test subgraph (held-out eval edges) |
Each `{train,test}_split/` holds:
| file | description |
|-----------------------------------|-------------|
| `node_metadata.json` | Per-node `{type, name, downloads, info}` |
| `edge_metadata_normalized.json` | Normalized `(u,v) → metric: value` map |
| `edges.npz` | Message-passing edges, `edges` key, shape `(2, E)` |
| `pos_edges.npz` | Positive eval edges (model × dataset with metric) |
## Node types
- `model`: HuggingFace models (e.g., `sileod/deberta-v3-large-tasksource-nli`)
- `dataset`: HuggingFace datasets (e.g., `nyu-mll/multi_nli`)
- `paper`: referenced papers (arXiv IDs)
- `codebase`: linked repositories
## Edge types
- `model ↔ dataset` (eval): accuracy / F1 / BLEU / etc. (normalized to `[0, 1]`)
- `model ↔ paper`, `model ↔ codebase`, `dataset ↔ paper`, `dataset ↔ codebase`: resource links
- `model ↔ model`: base-model / fine-tune relations
## Usage
```python
from huggingface_hub import snapshot_download
path = snapshot_download("lwaekfjlk/artifact-graph", repo_type="dataset")
import numpy as np, json
emb = np.load(f"{path}/transductive/node_embeddings_voyage.npy")
nm = json.load(open(f"{path}/transductive/train_split/node_metadata.json"))
pe = np.load(f"{path}/transductive/train_split/pos_edges.npz")["edges"]
print(emb.shape, len(nm), pe.shape)
```
## Case study: NLI (`case_study_nli/`)
Frozen 576-cell evaluation grid used as the NLI case study in our paper:
48 NLI models × 12 NLI datasets. Each cell was produced by an LLM-coder
pipeline that emitted per-example predictions and a top-level accuracy.
### Layout
| path | description |
|---|---|
| `case_study_nli/raw_evals/<model>_<dataset>_accuracy/predictions.json` | Per-example `{idx, prediction, ground_truth}` |
| `case_study_nli/raw_evals/<model>_<dataset>_accuracy/results.json` | `{accuracy: float}` (plus `previous_accuracy` for 9 bug cells) |
| `case_study_nli/all_results_summary_fixed.json` | Cleaned aggregate: 576 rows, 9 bug fixes applied, masked flags for cells that cannot be scored 3-way |
| `case_study_nli/scripts/rebuild_nli_summary.py` | Raw → fixed aggregate |
| `case_study_nli/scripts/plot_nli_heatmap.py` | 45-model heatmap (3 models with degenerate cells excluded by `--min-cell 0.05`) |
| `case_study_nli/scripts/plot_nli_matrix_scree.py` | Double-centered SVD scree plot |
| `case_study_nli/figures/nli_results_heatmap.{png,pdf}` | Main heatmap |
| `case_study_nli/figures/nli_matrix_scree.{png,pdf}` | Scree plot |
### Known issues in the raw evaluations
1. **9 bug-fix cells**: top-level `accuracy=0` was overwritten but
`previous_accuracy>0` holds the real value. The fixed aggregate uses
`previous_accuracy`.
2. **Binary-output models on 3-way datasets**: three zero-shot classifiers
only emit 2 labels; their MNLI / SNLI / ANLI / NLI_FEVER cells are
masked in the aggregate (not directly comparable to 3-way models).
3. **2 true failures**: `microsoft/deberta-v3-base` on `allenai/scitail`
and `araag2/MedNLI` produced degenerate predictions.
### Reproducibility note
The per-cell evaluation scripts were not uniformly persisted to disk —
cells run in the January batch retained them, but the April re-runs (the
majority) executed inline via an agent and only wrote results back. We
therefore ship just the frozen outputs (`predictions.json` + `results.json`)
rather than an incomplete script set. The processing scripts in
`case_study_nli/scripts/` are sufficient to regenerate the aggregate and
figures from the per-cell outputs.
### Reproduce aggregate + figures
```bash
pip install datasets numpy matplotlib huggingface_hub
python scripts/rebuild_nli_summary.py \
--src case_study_nli/raw_evals \
--out case_study_nli/all_results_summary_fixed.json
python scripts/plot_nli_heatmap.py \
--input case_study_nli/all_results_summary_fixed.json \
--out-dir case_study_nli/figures
python scripts/plot_nli_matrix_scree.py \
--input case_study_nli/all_results_summary_fixed.json \
--out-dir case_study_nli/figures
```
## Verification bench (`verification_bench/`)
Full agent-based eval reproductions: 263 (model, dataset, metric) cells
drawn from a stratified "hard" sample of the artifact graph. A
skill-based multi-agent system (driver: GPT-5.2, tool mode:
multiturn_metadatatool) attempts to reproduce each published accuracy
score by locating the dataset, loading the model, writing an eval script,
and reporting a metric.
### Layout
```
verification_bench/
└── skills_multiagent_gpt-5.2_metadatatool/
└── <model>_<dataset>_<metric>/
├── metadata.json # (model, dataset, metric) spec
├── run_eval.py # agent-written evaluation script
├── predictions.json # per-example predictions
├── results.json # top-level metric value
├── run.log # agent trajectory log
└── results_full.json # rich metric breakdown (4 cells only)
```
### Use
```python
import json, os
ROOT = "verification_bench/skills_multiagent_gpt-5.2_metadatatool"
for cell in os.listdir(ROOT):
meta = json.load(open(f"{ROOT}/{cell}/metadata.json"))
result = json.load(open(f"{ROOT}/{cell}/results.json"))
print(meta["model_id"], meta["dataset_id"], result)
```
### Notes
- 263 / 266 cell dirs contain a complete `results.json`; the remaining
3 failed with agent / runtime errors.
- Cell directories are named `<model>_<dataset>_<metric>` with `/`
replaced by `_` in HuggingFace IDs.
- This suite is the best-performing agent configuration we evaluated
(156 cells above accuracy 0.5, 97 above 0.8); scores are properly
normalised to `[0, 1]`.
许可证:CC BY 4.0
语言:
- 英语
标签:
- 图
- 链接预测(link prediction)
- 基准测试
- 模型-数据集
规模类别:
- 10K < 样本量 < 100K
# 工件图(Artifact Graph)
异构图,包含Hugging Face模型、数据集、论文、代码仓库四类节点,以及观测得到的(模型、数据集、性能指标)评估边,用于链接预测与属性回归任务的基准测试。
## 内容
| 路径 | 描述 |
|-----------------------------------|-------------|
| `full/` | 完整未拆分图:包含所有节点与所有按类型划分的边 |
| `transductive/` | 训练与测试集均可见全部节点;边已拆分 |
| `inductive/` | 不相交节点划分:部分节点仅用于训练,其余仅用于测试 |
## 全图(`full/`)
| 文件 | 描述 |
|---------------------------------------|-------------|
| `node_metadata.json` | 所有14,000个节点的单节点元数据:`{类型、名称、下载量、信息}` |
| `node_mappings.json` | 整数ID ↔ Hugging Face ID 映射表 |
| `node_embeddings_voyage.npy` | Voyage-3 嵌入向量,形状为`(N, 1024)` |
| `node_embeddings_random.npy` | L2归一化随机嵌入向量 |
| `edges.npz` | 所有边的合并文件,格式为`(2, E)` |
| `edges_eval.npz` | 模型 × 数据集的评估边 |
| `edges_base_model.npz` | 模型 → 基座模型边 |
| `edges_resource.npz` | 模型/数据集 → 论文/代码仓库边 |
| `edge_metadata.json` | 原始(模型、数据集、指标)边记录 |
| `edge_metadata_normalized.json` | 指标已归一化至`[0, 1]`区间的评估边元数据 |
| `edge_metadata_eval.json` | 仅包含评估边的元数据 |
| `edge_metadata_base_model.json` | 基座模型边元数据 |
| `edge_metadata_resource.json` | 论文/代码仓库资源边元数据 |
每个拆分目录均包含以下文件:
| 文件 | 描述 |
|---------------------------------|-------------|
| `node_embeddings_voyage.npy` | Voyage-3 嵌入向量,形状为`(N, 1024)` |
| `node_embeddings_random.npy` | L2归一化随机嵌入向量,形状一致 |
| `split_info.json` | 拆分元数据(随机种子、样本数、日期) |
| `node_split.json`(归纳式拆分) | 单节点训练/测试集分配信息 |
| `train_split/` | 训练子图(详见下文) |
| `test_split/` | 测试子图(留存评估边) |
每个`{train,test}_split/`目录包含:
| 文件 | 描述 |
|-----------------------------------|-------------|
| `node_metadata.json` | 单节点元数据:`{类型、名称、下载量、信息}` |
| `edge_metadata_normalized.json` | 归一化的`(u,v) → 指标值`映射表 |
| `edges.npz` | 消息传递边,`edges`键对应形状为`(2, E)`的数组 |
| `pos_edges.npz` | 正样本评估边(带有指标的模型×数据集边) |
## 节点类型
- `model`:Hugging Face模型(例如:`sileod/deberta-v3-large-tasksource-nli`)
- `dataset`:Hugging Face数据集(例如:`nyu-mll/multi_nli`)
- `paper`:引用论文(arXiv ID)
- `codebase`:关联代码仓库
## 边类型
- `模型 ↔ 数据集`(评估):准确率、F1值、BLEU值等(已归一化至`[0, 1]`区间)
- `模型 ↔ 论文`、`模型 ↔ 代码仓库`、`数据集 ↔ 论文`、`数据集 ↔ 代码仓库`:资源关联边
- `模型 ↔ 模型`:基座模型/微调关系边
## 使用方法
python
from huggingface_hub import snapshot_download
path = snapshot_download("lwaekfjlk/artifact-graph", repo_type="dataset")
import numpy as np, json
emb = np.load(f"{path}/transductive/node_embeddings_voyage.npy")
nm = json.load(open(f"{path}/transductive/train_split/node_metadata.json"))
pe = np.load(f"{path}/transductive/train_split/pos_edges.npz")["edges"]
print(emb.shape, len(nm), pe.shape)
## 案例研究:自然语言推理(NLI,Natural Language Inference)`case_study_nli/`
该部分为固定的576单元格评估网格,用于本文中的NLI案例研究:包含48个NLI模型 × 12个NLI数据集。每个单元格由大语言模型编码流水线生成,包含单样本预测结果与顶层准确率。
### 布局
| 路径 | 描述 |
|---|---|
| `case_study_nli/raw_evals/<model>_<dataset>_accuracy/predictions.json` | 单样本元数据:`{idx, prediction, ground_truth}` |
| `case_study_nli/raw_evals/<model>_<dataset>_accuracy/results.json` | 结果元数据:`{accuracy: float}`(9个存在缺陷的单元格包含`previous_accuracy`字段存储真实值) |
| `case_study_nli/all_results_summary_fixed.json` | 清洗后的聚合结果:共576行,已修复9个缺陷单元格,对无法进行3分类评分的单元格添加了掩码标记 |
| `case_study_nli/scripts/rebuild_nli_summary.py` | 原始数据→修复后聚合结果的转换脚本 |
| `case_study_nli/scripts/plot_nli_heatmap.py` | 绘制45个模型的热力图(通过`--min-cell 0.05`参数排除3个退化单元格的模型) |
| `case_study_nli/scripts/plot_nli_matrix_scree.py` | 双中心化SVD碎石图 |
| `case_study_nli/figures/nli_results_heatmap.{png,pdf}` | 主热力图 |
| `case_study_nli/figures/nli_matrix_scree.{png,pdf}` | 碎石图 |
### 原始评估中的已知问题
1. **9个缺陷修复单元格**:顶层`accuracy=0`被错误覆盖,`previous_accuracy>0`字段存储了真实值。修复后的聚合结果使用`previous_accuracy`字段的值。
2. **二分类模型在三分类数据集上的问题**:3个零样本分类器仅输出2个标签;其在MNLI / SNLI / ANLI / NLI_FEVER数据集上的单元格被排除在聚合结果之外(无法与三分类模型直接比较)。
3. **2个真正的失败案例**:`microsoft/deberta-v3-base`在`allenai/scitail`与`araag2/MedNLI`数据集上生成了退化的预测结果。
### 可复现性说明
每个单元格的评估脚本并未统一持久化到磁盘:1月批次运行的单元格保留了脚本,但4月的重新运行(绝大多数)通过智能体在线执行,仅写入了结果文件。因此我们仅提供固化的输出结果(`predictions.json` + `results.json`),而非不完整的脚本集。`case_study_nli/scripts/`目录下的处理脚本足以从单单元格输出中重新生成聚合结果与图表。
### 重现聚合结果与图表
bash
pip install datasets numpy matplotlib huggingface_hub
python scripts/rebuild_nli_summary.py
--src case_study_nli/raw_evals
--out case_study_nli/all_results_summary_fixed.json
python scripts/plot_nli_heatmap.py
--input case_study_nli/all_results_summary_fixed.json
--out-dir case_study_nli/figures
python scripts/plot_nli_matrix_scree.py
--input case_study_nli/all_results_summary_fixed.json
--out-dir case_study_nli/figures
## 验证基准(Verification Bench)`verification_bench/`
完整的智能体评估复现套件:从工件图的分层“困难”样本中抽取的263个(模型、数据集、指标)单元格。基于技能的多智能体系统(驱动模型:GPT-5.2,工具模式:多轮元数据工具)尝试复现每个已发布的准确率得分,流程包括定位数据集、加载模型、编写评估脚本并报告指标值。
### 目录结构
verification_bench/
└── skills_multiagent_gpt-5.2_metadatatool/
└── <model>_<dataset>_<metric>/
├── metadata.json # (模型、数据集、指标)规格文件
├── run_eval.py # 智能体编写的评估脚本
├── predictions.json # 单样本预测结果
├── results.json # 顶层指标值
├── run.log # 智能体运行轨迹日志
└── results_full.json # 丰富的指标细分(仅4个单元格包含)
### 使用方法
python
import json, os
ROOT = "verification_bench/skills_multiagent_gpt-5.2_metadatatool"
for cell in os.listdir(ROOT):
meta = json.load(open(f"{ROOT}/{cell}/metadata.json"))
result = json.load(open(f"{ROOT}/{cell}/results.json"))
print(meta["model_id"], meta["dataset_id"], result)
### 注意事项
- 263 / 266个单元格目录包含完整的`results.json`文件;其余3个因智能体/运行时错误失败。
- 单元格目录以`<model>_<dataset>_<metric>`命名,Hugging Face ID中的`/`被替换为`_`。
- 该套件为我们评估过的性能最优的智能体配置(156个单元格准确率高于0.5,97个高于0.8);得分已正确归一化至`[0, 1]`区间。
提供机构:
lwaekfjlk



