hlido-eu/agent-benchmark

Name: hlido-eu/agent-benchmark
Creator: hlido-eu
Published: 2026-04-11 06:08:17
License: 暂无描述

Hugging Face2026-04-11 更新2026-04-12 收录

下载链接：

https://hf-mirror.com/datasets/hlido-eu/agent-benchmark

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - en license: cc-by-nc-4.0 tags: - ai-agents - benchmark - evaluation - agent-evaluation - llm-agents - trustworthy-ai - product-review - c2pa pretty_name: Hlido AI Agent Benchmark size_categories: - n<1K task_categories: - other --- # Hlido AI Agent Benchmark **Independent, cryptographically-attested evaluations of AI agents and products.** Hlido reviews AI agents the way Rotten Tomatoes reviews films — independent verdicts with proof, one clear score, and no sponsored rankings. This dataset is the machine-readable output of those reviews. Every run is: - Executed autonomously by the Hlido testing pipeline - Cryptographically signed using C2PA (Coalition for Content Provenance and Authenticity) - Scored using Hlido's proprietary two-layer evaluation model **Website:** [hlido.eu](https://hlido.eu) **Methodology:** [hlido.eu/methodology](https://hlido.eu/methodology/) **Updated:** Daily (automated pipeline) > © 2026 Hlido. The Laddoo Score, scoring methodology, tier system, and evaluation framework are proprietary. See licence and terms below. --- ## What's in this dataset `run-log.json` — the full review ledger. Each entry is one Hlido pipeline run. ### Schema | Field | Type | Description | |---|---|---| | `runId` | string | Unique run identifier | | `domain` | string | Domain tested (e.g. `getbaton.dev`) | | `url` | string | Entry URL used for the review | | `startedAt` | ISO 8601 | When the pipeline run began | | `finishedAt` | ISO 8601 | When the pipeline run completed | | `score` | number \| null | Engine surface score (0–100). Null if run incomplete. | | `tier` | string \| null | Engine tier label | | `laddooScore` | number \| null | Hlido Laddoo Score (0–100) — final composite score | | `laddooTier` | string \| null | VITAL / STEADY / FADING / FLATLINE | | `pagesVisited` | array | Each page crawled: url, type, blocked | | `blockersHit` | array | Blockers encountered: type, action taken, confidence | | `claimsTested` | number | Number of marketing claims evaluated | | `claimsVerified` | number | Claims that held up under testing | | `errors` | number | Pipeline errors during run | | `c2paSigned` | boolean | Whether artifacts were C2PA-signed | | `t1Confidence` | string | High / Medium / Low — proportion of product surface testable | | `testTier` | string | Tier 1 UI Auto / Tier 2 Guided Manual / Tier 3 Custom | | `loggedAt` | ISO 8601 | When the record was written | | `note` | string | Notes on blockers, anomalies, or methodology context | --- ## Scoring Overview Hlido uses a proprietary two-layer scoring model. Full methodology is published at [hlido.eu/methodology](https://hlido.eu/methodology/). ### Layer 1 — Engine Score Automated measurement of the public web surface across six dimensions: product clarity, feature depth, trust signals, claim verification, user experience, and pricing transparency. The engine score reflects what an independent observer can verify from the public surface without login or payment. ### Layer 2 — Hlido Laddoo Score Hlido's independent editorial assessment of the product across four dimensions: - **Strategic Alpha** — Does it solve a real problem in a differentiated way? - **Craft & Soul** — Is the overall product experience excellent? - **Execution Grit** — Does the product deliver on its core promise? - **Value Signal** — Is pricing fair and transparent relative to its category? The exact weights and blending formula are proprietary and not disclosed in this dataset. **Laddoo tiers:** VITAL (90–100) · STEADY (70–89) · FADING (40–69) · FLATLINE (0–39) ### Login Walls and Tier Confidence A product behind a login wall is not automatically scored lower on product quality. The engine score reflects what could be verified publicly; the editorial Laddoo Score assesses the product on its own merits. All reviews carry a confidence label (T1 High / Medium / Low) and a note on how much of the surface was testable. --- ## Evidence & Attestation Each reviewed agent has a corresponding proof bundle (stored privately at Hlido): - Screen recording of the review session (MP4) - Timestamped screenshots of key moments (PNG) - `attestation.json` — structured metadata: domain, claims tested, blockers, scores - All artifacts C2PA-signed (PS256, DigiCert timestamp authority) C2PA signing means the video and screenshots carry a cryptographically verifiable chain of custody — provably captured by Hlido, provably unmodified. Proof bundles are available to reviewed agents on request. --- ## How to use this dataset ```python import json, urllib.request url = "https://huggingface.co/datasets/hlido-eu/agent-benchmark/resolve/main/run-log.json" with urllib.request.urlopen(url) as r: data = json.load(r) runs = data["runs"] # Filter to scored runs only scored = [r for r in runs if r.get("laddooScore") is not None] # Sort by Laddoo Score top = sorted(scored, key=lambda x: x["laddooScore"], reverse=True) for r in top[:10]: print(f"{r['domain']:30} {r['laddooScore']:3}/100 {r.get('laddooTier','')}") ``` --- ## Licence & Terms This dataset is released under [CC BY-NC 4.0](https://creativecommons.org/licenses/by-nc/4.0/). You may: - Use and share the data for **non-commercial research, journalism, and personal projects** with attribution - Cite Hlido in academic papers and public analysis You may **not**: - Use this data to build a competing AI agent evaluation or scoring product - Reproduce, replicate, or adapt the Hlido scoring methodology, Laddoo Score formula, or tier system in any commercial product - Represent Hlido scores as your own without clear attribution - Train commercial models on this data without written permission For commercial licensing, API access, or bulk data: [hlido.eu](https://hlido.eu) **Attribution:** "Hlido AI Agent Benchmark, hlido.eu" --- ## Citation ```bibtex @misc{hlido2026benchmark, title = {Hlido AI Agent Benchmark}, author = {Hlido}, year = {2026}, url = {https://huggingface.co/datasets/hlido-eu/agent-benchmark}, note = {Independent, cryptographically-attested AI agent evaluations. Laddoo Score and methodology are proprietary. Updated daily.} } ``` --- ## Contact [hlido.eu](https://hlido.eu) — submit an agent for review, request a T2 deep evaluation, or enquire about commercial data licensing.

提供机构：

hlido-eu

搜集汇总

数据集介绍

构建方式

在人工智能代理评估领域，Hlido AI Agent Benchmark 的构建体现了严谨的自动化与可验证性原则。该数据集通过一套自主运行的测试流水线生成，每条记录对应一次完整的代理评估流程。评估过程采用加密签名技术，依据C2PA标准确保数据来源的真实性与不可篡改性，并通过专有的双层评分模型——涵盖引擎表面评分与Laddoo综合评分——对代理产品的公开界面及核心价值进行量化分析。整个构建流程每日自动更新，形成了结构化的运行日志，为研究提供了动态、可审计的数据基础。

特点

该数据集的核心特点在于其独立性与可证明性。它并非简单的性能指标集合，而是融合了密码学认证的评估记录，每项结果都附有可验证的凭证链。数据集采用双层评分体系，既考察代理产品公开表面的自动化测量维度，也包含编辑团队对产品战略、工艺、执行与价值信号的深度评判。此外，数据条目富含元信息，如访问页面、遇到的阻碍、测试置信度及方法备注，为分析评估的边界条件与局限性提供了透明上下文，使其在可信人工智能评估领域具有独特的参考价值。

使用方法

研究人员可通过下载数据集的核心JSON文件，利用编程语言进行便捷的数据访问与分析。典型的使用流程包括加载数据、筛选出已评分的运行记录，并依据Laddoo分数进行排序，以识别表现优异的AI代理产品。该数据集适用于非商业性的学术研究、新闻报道或个人分析项目，使用者需遵循CC BY-NC 4.0许可协议，明确标注数据来源，并避免将其用于构建竞争性评估产品或商业模型的训练。对于更深入的商业应用，需联系官方获取授权。

背景与挑战

背景概述

在人工智能代理技术迅猛发展的背景下，对代理系统进行客观、可信的评估成为学术界与工业界共同关注的焦点。Hlido AI Agent Benchmark 数据集由 Hlido 机构于 2026 年创建，旨在通过独立、加密认证的评估框架，系统性地衡量各类 AI 代理产品的性能与可靠性。该数据集的核心研究问题聚焦于如何构建一个透明、可验证的评估体系，以解决当前 AI 代理领域缺乏标准化评测基准的困境。其采用自动化测试流程与双层评分模型，结合内容来源认证技术，为代理产品的质量评估提供了新颖的方法论支撑，对推动可信人工智能的发展具有显著影响力。

当前挑战

该数据集致力于解决 AI 代理评估领域的核心挑战，即如何在复杂多变的交互环境中，设计出既能全面反映代理能力、又具备高度可重复性与抗干扰性的评测标准。具体而言，构建过程中面临多重技术难题：自动化测试流程需克服网络环境动态性、界面异构性以及登录屏障等障碍，确保评估的覆盖度与一致性；同时，加密签名与证据链的集成要求实现评估过程的可追溯性与防篡改性，这在工程实现上带来了额外的复杂度。此外，平衡自动化评分与人工编辑评估的权重，以形成既客观又具洞察力的综合得分，亦是数据集设计中的关键挑战。

常用场景

经典使用场景

在人工智能代理评估领域，Hlido AI Agent Benchmark数据集为研究人员提供了一个标准化的评估框架，用于系统性地衡量各类AI代理的性能与可信度。该数据集通过自动化测试流水线，对代理的公共界面进行多维度扫描，涵盖产品清晰度、功能深度、信任信号等关键指标，从而生成可复现的评估结果。其经典使用场景包括横向比较不同AI代理在真实任务中的表现，为学术研究提供客观、可验证的性能基准，助力于推动AI代理技术的透明化与标准化进程。

解决学术问题

该数据集有效解决了人工智能代理评估中缺乏独立、可验证基准的学术难题。传统评估往往依赖主观判断或受限的测试环境，难以确保结果的公正性与可复现性。Hlido通过引入密码学认证的C2PA签名机制，保障了评估过程的数据完整性与来源可信性，同时其双层评分模型兼顾了自动化测量与编辑独立评估，为研究社区提供了兼具客观量化与深度质化的评估范式。这显著提升了AI代理评估的科学严谨性，并为可信人工智能研究奠定了数据基础。

衍生相关工作

围绕该数据集衍生的经典工作主要集中在增强评估方法论与拓展应用范畴。例如，有研究借鉴其密码学认证框架，开发了新型的自动化测试工具链，以提升多模态AI代理的评估效率；另有工作基于其公开的评分维度，构建了细粒度的性能预测模型，用于早期识别代理系统的潜在缺陷。此外，该数据集的评估逻辑也被应用于垂直领域，如金融咨询与医疗诊断代理的专项评测，推动了评估标准向专业化、场景化方向演进。

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集