five

lwaekfjlk/artifact-bench

收藏
Hugging Face2026-04-21 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/lwaekfjlk/artifact-bench
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-4.0 language: - en tags: - graph - link-prediction - benchmark - model-dataset size_categories: - 10K<n<100K --- # Artifact Graph A heterogeneous graph of HuggingFace model/dataset/paper/codebase nodes with observed (model, dataset, performance-metric) evaluation edges, used to benchmark link prediction and attribute regression. ## Contents | path | description | |-----------------------------------|-------------| | `full/` | Full unsplit graph: all nodes + all edges (by type) | | `transductive/` | All nodes visible in both train and test; edges split | | `inductive/` | Disjoint node partition: some nodes train-only, others test-only | ## Full graph (`full/`) | file | description | |---------------------------------------|-------------| | `node_metadata.json` | Per-node `{type, name, downloads, info}` for all 14K nodes | | `node_mappings.json` | Integer ID ↔ HuggingFace ID mapping | | `node_embeddings_voyage.npy` | Voyage-3 embeddings, `(N, 1024)` | | `node_embeddings_random.npy` | L2-normalised random embeddings | | `edges.npz` | All edges combined, `(2, E)` | | `edges_eval.npz` | model × dataset evaluation edges | | `edges_base_model.npz` | model → base_model edges | | `edges_resource.npz` | model/dataset → paper/codebase edges | | `edge_metadata.json` | Raw (model, dataset, metric) edge records | | `edge_metadata_normalized.json` | Eval edges with metrics normalised to `[0, 1]` | | `edge_metadata_eval.json` | Eval-edge metadata only | | `edge_metadata_base_model.json` | base-model edge metadata | | `edge_metadata_resource.json` | paper / codebase resource edge metadata | Each split directory contains: | file | description | |---------------------------------|-------------| | `node_embeddings_voyage.npy` | Voyage-3 embeddings, shape `(N, 1024)` | | `node_embeddings_random.npy` | L2-normalised random embeddings, same shape | | `split_info.json` | Split metadata (seed, counts, dates) | | `node_split.json` (inductive) | Per-node train/test assignment | | `train_split/` | Training subgraph (see below) | | `test_split/` | Test subgraph (held-out eval edges) | Each `{train,test}_split/` holds: | file | description | |-----------------------------------|-------------| | `node_metadata.json` | Per-node `{type, name, downloads, info}` | | `edge_metadata_normalized.json` | Normalized `(u,v) → metric: value` map | | `edges.npz` | Message-passing edges, `edges` key, shape `(2, E)` | | `pos_edges.npz` | Positive eval edges (model × dataset with metric) | ## Node types - `model`: HuggingFace models (e.g., `sileod/deberta-v3-large-tasksource-nli`) - `dataset`: HuggingFace datasets (e.g., `nyu-mll/multi_nli`) - `paper`: referenced papers (arXiv IDs) - `codebase`: linked repositories ## Edge types - `model ↔ dataset` (eval): accuracy / F1 / BLEU / etc. (normalized to `[0, 1]`) - `model ↔ paper`, `model ↔ codebase`, `dataset ↔ paper`, `dataset ↔ codebase`: resource links - `model ↔ model`: base-model / fine-tune relations ## Usage ```python from huggingface_hub import snapshot_download path = snapshot_download("lwaekfjlk/artifact-graph", repo_type="dataset") import numpy as np, json emb = np.load(f"{path}/transductive/node_embeddings_voyage.npy") nm = json.load(open(f"{path}/transductive/train_split/node_metadata.json")) pe = np.load(f"{path}/transductive/train_split/pos_edges.npz")["edges"] print(emb.shape, len(nm), pe.shape) ``` ## Case study: NLI (`case_study_nli/`) Frozen 576-cell evaluation grid used as the NLI case study in our paper: 48 NLI models × 12 NLI datasets. Each cell was produced by an LLM-coder pipeline that emitted per-example predictions and a top-level accuracy. ### Layout | path | description | |---|---| | `case_study_nli/raw_evals/<model>_<dataset>_accuracy/predictions.json` | Per-example `{idx, prediction, ground_truth}` | | `case_study_nli/raw_evals/<model>_<dataset>_accuracy/results.json` | `{accuracy: float}` (plus `previous_accuracy` for 9 bug cells) | | `case_study_nli/all_results_summary_fixed.json` | Cleaned aggregate: 576 rows, 9 bug fixes applied, masked flags for cells that cannot be scored 3-way | | `case_study_nli/scripts/rebuild_nli_summary.py` | Raw → fixed aggregate | | `case_study_nli/scripts/plot_nli_heatmap.py` | 45-model heatmap (3 models with degenerate cells excluded by `--min-cell 0.05`) | | `case_study_nli/scripts/plot_nli_matrix_scree.py` | Double-centered SVD scree plot | | `case_study_nli/figures/nli_results_heatmap.{png,pdf}` | Main heatmap | | `case_study_nli/figures/nli_matrix_scree.{png,pdf}` | Scree plot | ### Known issues in the raw evaluations 1. **9 bug-fix cells**: top-level `accuracy=0` was overwritten but `previous_accuracy>0` holds the real value. The fixed aggregate uses `previous_accuracy`. 2. **Binary-output models on 3-way datasets**: three zero-shot classifiers only emit 2 labels; their MNLI / SNLI / ANLI / NLI_FEVER cells are masked in the aggregate (not directly comparable to 3-way models). 3. **2 true failures**: `microsoft/deberta-v3-base` on `allenai/scitail` and `araag2/MedNLI` produced degenerate predictions. ### Reproducibility note The per-cell evaluation scripts were not uniformly persisted to disk — cells run in the January batch retained them, but the April re-runs (the majority) executed inline via an agent and only wrote results back. We therefore ship just the frozen outputs (`predictions.json` + `results.json`) rather than an incomplete script set. The processing scripts in `case_study_nli/scripts/` are sufficient to regenerate the aggregate and figures from the per-cell outputs. ### Reproduce aggregate + figures ```bash pip install datasets numpy matplotlib huggingface_hub python scripts/rebuild_nli_summary.py \ --src case_study_nli/raw_evals \ --out case_study_nli/all_results_summary_fixed.json python scripts/plot_nli_heatmap.py \ --input case_study_nli/all_results_summary_fixed.json \ --out-dir case_study_nli/figures python scripts/plot_nli_matrix_scree.py \ --input case_study_nli/all_results_summary_fixed.json \ --out-dir case_study_nli/figures ``` ## Verification bench (`verification_bench/`) Full agent-based eval reproductions: 263 (model, dataset, metric) cells drawn from a stratified "hard" sample of the artifact graph. A skill-based multi-agent system (driver: GPT-5.2, tool mode: multiturn_metadatatool) attempts to reproduce each published accuracy score by locating the dataset, loading the model, writing an eval script, and reporting a metric. ### Layout ``` verification_bench/ └── skills_multiagent_gpt-5.2_metadatatool/ └── <model>_<dataset>_<metric>/ ├── metadata.json # (model, dataset, metric) spec ├── run_eval.py # agent-written evaluation script ├── predictions.json # per-example predictions ├── results.json # top-level metric value ├── run.log # agent trajectory log └── results_full.json # rich metric breakdown (4 cells only) ``` ### Use ```python import json, os ROOT = "verification_bench/skills_multiagent_gpt-5.2_metadatatool" for cell in os.listdir(ROOT): meta = json.load(open(f"{ROOT}/{cell}/metadata.json")) result = json.load(open(f"{ROOT}/{cell}/results.json")) print(meta["model_id"], meta["dataset_id"], result) ``` ### Notes - 263 / 266 cell dirs contain a complete `results.json`; the remaining 3 failed with agent / runtime errors. - Cell directories are named `<model>_<dataset>_<metric>` with `/` replaced by `_` in HuggingFace IDs. - This suite is the best-performing agent configuration we evaluated (156 cells above accuracy 0.5, 97 above 0.8); scores are properly normalised to `[0, 1]`.

许可证:CC BY 4.0 语言: - 英语 标签: - 图 - 链接预测(link prediction) - 基准测试 - 模型-数据集 规模类别: - 10K < 样本量 < 100K # 工件图(Artifact Graph) 异构图,包含Hugging Face模型、数据集、论文、代码仓库四类节点,以及观测得到的(模型、数据集、性能指标)评估边,用于链接预测与属性回归任务的基准测试。 ## 内容 | 路径 | 描述 | |-----------------------------------|-------------| | `full/` | 完整未拆分图:包含所有节点与所有按类型划分的边 | | `transductive/` | 训练与测试集均可见全部节点;边已拆分 | | `inductive/` | 不相交节点划分:部分节点仅用于训练,其余仅用于测试 | ## 全图(`full/`) | 文件 | 描述 | |---------------------------------------|-------------| | `node_metadata.json` | 所有14,000个节点的单节点元数据:`{类型、名称、下载量、信息}` | | `node_mappings.json` | 整数ID ↔ Hugging Face ID 映射表 | | `node_embeddings_voyage.npy` | Voyage-3 嵌入向量,形状为`(N, 1024)` | | `node_embeddings_random.npy` | L2归一化随机嵌入向量 | | `edges.npz` | 所有边的合并文件,格式为`(2, E)` | | `edges_eval.npz` | 模型 × 数据集的评估边 | | `edges_base_model.npz` | 模型 → 基座模型边 | | `edges_resource.npz` | 模型/数据集 → 论文/代码仓库边 | | `edge_metadata.json` | 原始(模型、数据集、指标)边记录 | | `edge_metadata_normalized.json` | 指标已归一化至`[0, 1]`区间的评估边元数据 | | `edge_metadata_eval.json` | 仅包含评估边的元数据 | | `edge_metadata_base_model.json` | 基座模型边元数据 | | `edge_metadata_resource.json` | 论文/代码仓库资源边元数据 | 每个拆分目录均包含以下文件: | 文件 | 描述 | |---------------------------------|-------------| | `node_embeddings_voyage.npy` | Voyage-3 嵌入向量,形状为`(N, 1024)` | | `node_embeddings_random.npy` | L2归一化随机嵌入向量,形状一致 | | `split_info.json` | 拆分元数据(随机种子、样本数、日期) | | `node_split.json`(归纳式拆分) | 单节点训练/测试集分配信息 | | `train_split/` | 训练子图(详见下文) | | `test_split/` | 测试子图(留存评估边) | 每个`{train,test}_split/`目录包含: | 文件 | 描述 | |-----------------------------------|-------------| | `node_metadata.json` | 单节点元数据:`{类型、名称、下载量、信息}` | | `edge_metadata_normalized.json` | 归一化的`(u,v) → 指标值`映射表 | | `edges.npz` | 消息传递边,`edges`键对应形状为`(2, E)`的数组 | | `pos_edges.npz` | 正样本评估边(带有指标的模型×数据集边) | ## 节点类型 - `model`:Hugging Face模型(例如:`sileod/deberta-v3-large-tasksource-nli`) - `dataset`:Hugging Face数据集(例如:`nyu-mll/multi_nli`) - `paper`:引用论文(arXiv ID) - `codebase`:关联代码仓库 ## 边类型 - `模型 ↔ 数据集`(评估):准确率、F1值、BLEU值等(已归一化至`[0, 1]`区间) - `模型 ↔ 论文`、`模型 ↔ 代码仓库`、`数据集 ↔ 论文`、`数据集 ↔ 代码仓库`:资源关联边 - `模型 ↔ 模型`:基座模型/微调关系边 ## 使用方法 python from huggingface_hub import snapshot_download path = snapshot_download("lwaekfjlk/artifact-graph", repo_type="dataset") import numpy as np, json emb = np.load(f"{path}/transductive/node_embeddings_voyage.npy") nm = json.load(open(f"{path}/transductive/train_split/node_metadata.json")) pe = np.load(f"{path}/transductive/train_split/pos_edges.npz")["edges"] print(emb.shape, len(nm), pe.shape) ## 案例研究:自然语言推理(NLI,Natural Language Inference)`case_study_nli/` 该部分为固定的576单元格评估网格,用于本文中的NLI案例研究:包含48个NLI模型 × 12个NLI数据集。每个单元格由大语言模型编码流水线生成,包含单样本预测结果与顶层准确率。 ### 布局 | 路径 | 描述 | |---|---| | `case_study_nli/raw_evals/<model>_<dataset>_accuracy/predictions.json` | 单样本元数据:`{idx, prediction, ground_truth}` | | `case_study_nli/raw_evals/<model>_<dataset>_accuracy/results.json` | 结果元数据:`{accuracy: float}`(9个存在缺陷的单元格包含`previous_accuracy`字段存储真实值) | | `case_study_nli/all_results_summary_fixed.json` | 清洗后的聚合结果:共576行,已修复9个缺陷单元格,对无法进行3分类评分的单元格添加了掩码标记 | | `case_study_nli/scripts/rebuild_nli_summary.py` | 原始数据→修复后聚合结果的转换脚本 | | `case_study_nli/scripts/plot_nli_heatmap.py` | 绘制45个模型的热力图(通过`--min-cell 0.05`参数排除3个退化单元格的模型) | | `case_study_nli/scripts/plot_nli_matrix_scree.py` | 双中心化SVD碎石图 | | `case_study_nli/figures/nli_results_heatmap.{png,pdf}` | 主热力图 | | `case_study_nli/figures/nli_matrix_scree.{png,pdf}` | 碎石图 | ### 原始评估中的已知问题 1. **9个缺陷修复单元格**:顶层`accuracy=0`被错误覆盖,`previous_accuracy>0`字段存储了真实值。修复后的聚合结果使用`previous_accuracy`字段的值。 2. **二分类模型在三分类数据集上的问题**:3个零样本分类器仅输出2个标签;其在MNLI / SNLI / ANLI / NLI_FEVER数据集上的单元格被排除在聚合结果之外(无法与三分类模型直接比较)。 3. **2个真正的失败案例**:`microsoft/deberta-v3-base`在`allenai/scitail`与`araag2/MedNLI`数据集上生成了退化的预测结果。 ### 可复现性说明 每个单元格的评估脚本并未统一持久化到磁盘:1月批次运行的单元格保留了脚本,但4月的重新运行(绝大多数)通过智能体在线执行,仅写入了结果文件。因此我们仅提供固化的输出结果(`predictions.json` + `results.json`),而非不完整的脚本集。`case_study_nli/scripts/`目录下的处理脚本足以从单单元格输出中重新生成聚合结果与图表。 ### 重现聚合结果与图表 bash pip install datasets numpy matplotlib huggingface_hub python scripts/rebuild_nli_summary.py --src case_study_nli/raw_evals --out case_study_nli/all_results_summary_fixed.json python scripts/plot_nli_heatmap.py --input case_study_nli/all_results_summary_fixed.json --out-dir case_study_nli/figures python scripts/plot_nli_matrix_scree.py --input case_study_nli/all_results_summary_fixed.json --out-dir case_study_nli/figures ## 验证基准(Verification Bench)`verification_bench/` 完整的智能体评估复现套件:从工件图的分层“困难”样本中抽取的263个(模型、数据集、指标)单元格。基于技能的多智能体系统(驱动模型:GPT-5.2,工具模式:多轮元数据工具)尝试复现每个已发布的准确率得分,流程包括定位数据集、加载模型、编写评估脚本并报告指标值。 ### 目录结构 verification_bench/ └── skills_multiagent_gpt-5.2_metadatatool/ └── <model>_<dataset>_<metric>/ ├── metadata.json # (模型、数据集、指标)规格文件 ├── run_eval.py # 智能体编写的评估脚本 ├── predictions.json # 单样本预测结果 ├── results.json # 顶层指标值 ├── run.log # 智能体运行轨迹日志 └── results_full.json # 丰富的指标细分(仅4个单元格包含) ### 使用方法 python import json, os ROOT = "verification_bench/skills_multiagent_gpt-5.2_metadatatool" for cell in os.listdir(ROOT): meta = json.load(open(f"{ROOT}/{cell}/metadata.json")) result = json.load(open(f"{ROOT}/{cell}/results.json")) print(meta["model_id"], meta["dataset_id"], result) ### 注意事项 - 263 / 266个单元格目录包含完整的`results.json`文件;其余3个因智能体/运行时错误失败。 - 单元格目录以`<model>_<dataset>_<metric>`命名,Hugging Face ID中的`/`被替换为`_`。 - 该套件为我们评估过的性能最优的智能体配置(156个单元格准确率高于0.5,97个高于0.8);得分已正确归一化至`[0, 1]`区间。
提供机构:
lwaekfjlk
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作