ibm-research/Auto-BenchmarkCard
收藏Hugging Face2025-12-16 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/ibm-research/Auto-BenchmarkCard
下载链接
链接失效反馈官方服务:
资源简介:
Auto-BenchmarkCard是一个经过验证的AI评估基准描述的数据集,通过LLM辅助、人工审核的方式生成,提供了关于基准目的、方法、数据源、风险和限制的结构化元数据。该数据集旨在通过提供经过评估的结构化描述来解决基准文档不完整或不一致的问题。它包含从Unitxt目录、Hugging Face存储库和学术论文等多种来源提取的信息,并经过事实准确性评估。该数据集由IBM Research策划,主要为英文,采用社区数据许可协议 - 宽松版本2.0许可。它对研究人员和从业者非常有用,可用于比较基准特征、构建推荐系统、进行系统评价以及了解风险和限制。数据集的创建涉及自动化的三个阶段工作流程:提取、组合和验证。然而,它存在一些局限性,如潜在的LLM幻觉、对源数据质量的依赖以及验证的局限性。
Auto-BenchmarkCard is a dataset of validated AI evaluation benchmark descriptions, generated through an LLM-assisted, human-reviewed process, providing structured metadata about benchmark purpose, methodology, data sources, risks, and limitations. The dataset addresses the gap in incomplete or inconsistent benchmark documentation by offering evaluated, structured descriptions. It includes information extracted from various sources like Unitxt catalogue, Hugging Face repositories, and academic papers, and has been evaluated for factual accuracy. The dataset is curated by IBM Research, primarily in English, and is licensed under the Community Data License Agreement – Permissive, Version 2.0. It is useful for researchers and practitioners for comparing benchmark characteristics, building recommendation systems, conducting systematic reviews, and understanding risks and limitations. The dataset creation involves an automated three-phase workflow: extraction, composing, and validation. However, it has limitations such as potential LLM hallucinations, dependency on source data quality, and validation limitations.
提供机构:
ibm-research



