FlagEval|模型评估数据集|认知边界数据集

github2023-06-01 更新2025-02-07 收录

模型评估

认知边界

下载链接：

https://github.com/flageval-baai/FlagEval

下载链接

链接失效反馈

资源简介：

FlagEval数据集目前涵盖了22个评估集合，包含总计84,433个问题。它引入了一个基于“能力-任务-指标”维度的细致评估框架，能够详细揭示模型的认知边界。该评估覆盖了30多种能力、5项主要任务和4个关键指标，涉及超过600个子维度。

The FlagEval dataset currently encompasses 22 evaluation sets, comprising a total of 84,433 questions. It introduces a refined evaluation framework based on the 'ability-task-metric' dimension, which can meticulously unveil the cognitive boundaries of models. The evaluation covers over 30 abilities, five major tasks, and four key metrics, involving more than 600 sub-dimensions.

提供机构：

BAAI et al.

创建时间：

2023-06-01

原始信息汇总

FlagEval数据集概述

1. 总体介绍

定位：开源评估工具包及大模型评估开放平台
评估对象：基础模型、预训练算法、微调/压缩算法
评估场景：自然语言处理(NLP)、计算机视觉(CV)、音频、多模态
目标：开发科学、公正、清晰的基准测试和方法论

2. 子项目详情

2.1 mCLIPEval

功能：视觉语言模型(如CLIP)评估工具包
特点：
- 支持12种语言数据集和单语(英/中)数据集
- 评估任务：零样本分类、检索和组合
- 适配多种预训练模型(FlagAI/OpenCLIP/Chinese CLIP等)
- 支持多数据源准备(torchvision/huggingface/kaggle)
- 提供可视化评估结果(排行榜/模型对比)

2.2 ImageEval-prompt

功能：细粒度文本到图像(T2I)模型评估提示集
特点：
- 包含1,624英文提示和339中文提示
- 采用"双盲标注+第三方仲裁"标注方法
- 三维度评估：
  - 实体维度(对象/状态/颜色/数量/位置)
  - 风格维度(绘画风格/文化风格)
  - 细节维度(手部/面部特征/性别/逻辑知识)

2.3 C-SEM

功能：大模型语义理解评估系统(v1.0)
评估项目：
- 词汇级语义关系分类(LLSRC)
- 句子级语义关系分类(SLSRC)
- 句子级多义词分类(SLPWC)
- 句子级修辞手法分类(SLRFC)

3. 许可信息

主许可证：Apache 2.0
特殊许可：
- CLIP_benchmark使用MIT许可
- ImageNet1k数据集使用huggingface datasets和ImageNet许可

4. 联系方式

问题反馈：GitHub Issues或flageval@baai.ac.cn
合作招聘：基础模型评估相关职位
贡献邀请：欢迎提交新任务/数据集/工具

AI搜集汇总

数据集介绍

构建方式

FlagEval数据集的构建依托于多模态大模型的评估需求，涵盖了自然语言处理、计算机视觉、音频和多模态四大核心场景。其构建过程通过整合开源工具和平台，结合多种预训练模型和算法，确保了评估的科学性和全面性。数据来源多样化，包括torchvision、huggingface和kaggle等平台，确保了数据的广泛性和代表性。此外，FlagEval还通过双盲标注和第三方仲裁的方式对提示词进行精细标注，进一步提升了数据的质量和可靠性。

特点

FlagEval数据集的特点在于其多维度、多语言的评估能力。它不仅支持多语言（12种语言）和单语言（中英文）数据集，还涵盖了零样本分类、检索和组合任务。数据集的提示词标注分为实体、风格和细节三个维度，每个维度下又细分为多个子维度，如对象、状态、颜色等，提供了精细化的评估标准。此外，FlagEval还创新性地构建了C-SEM评估体系，从词汇和句子层面考察模型的语义理解能力，为研究提供了丰富的对比数据。

使用方法

FlagEval的使用方法简便且灵活。用户需确保环境满足Pytorch版本≥1.8.0和Python版本≥3.8的要求，并安装CUDA和NCCL以支持GPU评估。通过克隆GitHub仓库并安装依赖包，用户即可快速启动评估任务。FlagEval提供了详细的文档和示例代码，用户可根据需求选择不同的子项目进行评估，如mCLIPEval用于视觉语言模型评估，ImageEval-prompt用于文本到图像模型的细粒度评估。评估结果可通过排行榜图表或表格进行可视化展示，便于用户进行模型性能的详细对比和分析。

背景与挑战

背景概述

FlagEval是由北京智源人工智能研究院开发的开源评估工具包和开放平台，旨在为大模型的评估提供科学、公正和清晰的基准、方法和工具。该平台涵盖了自然语言处理（NLP）、计算机视觉（CV）、音频和多模态等关键评估场景，并包含丰富的下游任务。FlagEval的核心研究问题在于如何全面评估基础模型和训练算法的有效性，同时探索利用AI技术增强主观评估的客观性和效率。自推出以来，FlagEval在推动大模型评估领域的发展中发挥了重要作用，吸引了众多研究人员的关注和参与。

当前挑战

FlagEval面临的挑战主要集中在两个方面。首先，在领域问题方面，如何设计出能够全面覆盖多模态任务的评估基准，以确保模型在不同场景下的表现得到准确衡量，是一个复杂且具有挑战性的问题。其次，在构建过程中，如何确保评估数据的多样性和代表性，尤其是在多语言和多文化背景下，数据的收集和标注工作面临巨大挑战。此外，如何利用AI技术提升主观评估的客观性和效率，也是FlagEval团队需要持续探索和优化的方向。

常用场景

经典使用场景

FlagEval作为一个开源的大模型评估工具包，广泛应用于自然语言处理（NLP）、计算机视觉（CV）、音频和多模态领域的模型评估。其核心功能包括对基础模型、预训练算法以及微调/压缩算法的全面评估。通过提供丰富的下游任务数据集和评估场景，FlagEval为研究人员提供了一个科学、公正且透明的评估平台，帮助他们深入理解模型的性能与局限性。

解决学术问题

FlagEval解决了大模型评估中的多个关键学术问题，尤其是在多语言、多模态场景下的模型性能评估。通过引入mCLIPEval等子项目，FlagEval支持零样本分类、检索和组合任务，显著提升了模型在跨语言和跨模态任务中的表现。此外，C-SEM项目通过构建多层次、多难度的语义理解评估数据，弥补了现有大模型在语义理解能力上的不足，为模型优化提供了科学依据。

衍生相关工作

FlagEval的推出催生了一系列相关研究工作，尤其是在多模态模型评估领域。例如，基于mCLIPEval的研究推动了多语言CLIP模型的优化，而C-SEM项目则为语义理解模型的评估提供了新的基准。此外，FlagEval的开源特性吸引了大量开发者贡献新的评估任务和数据集，进一步丰富了其生态系统，推动了整个大模型评估领域的发展。

以上内容由AI搜集并总结生成

用户留言

有没有相关的论文或文献参考？

这个数据集是基于什么背景创建的？

数据集的作者是谁？

能帮我联系到这个数据集的作者吗？

这个数据集如何下载？

点击留言

数据主题

具身智能

数据集 4098个

机构 8个

大模型

数据集 439个

机构 10个

无人机

数据集 37个

机构 6个

指令微调

数据集 36个

机构 6个

蛋白质结构

数据集 50个

机构 8个

空间智能

数据集 21个

机构 5个

5,000+

优质数据集

54 个

任务类型

进入经典数据集

热门数据集

Canadian Census

**Overview** The data package provides demographics for Canadian population groups according to multiple location categories: Forward Sortation Areas (FSAs), Census Metropolitan Areas (CMAs) and Census Agglomerations (CAs), Federal Electoral Districts (FEDs), Health Regions (HRs) and provinces. **Description** The data are available through the Canadian Census and the National Household Survey (NHS), separated or combined. The main demographic indicators provided for the population groups, stratified not only by location but also for the majority by demographical and socioeconomic characteristics, are population number, females and males, usual residents and private dwellings. The primary use of the data at the Health Region level is for health surveillance and population health research. Federal and provincial departments of health and human resources, social service agencies, and other types of government agencies use the information to monitor, plan, implement and evaluate programs to improve the health of Canadians and the efficiency of health services. Researchers from various fields use the information to conduct research to improve health. Non-profit health organizations and the media use the health region data to raise awareness about health, an issue of concern to all Canadians. The Census population counts for a particular geographic area representing the number of Canadians whose usual place of residence is in that area, regardless of where they happened to be on Census Day. Also included are any Canadians who were staying in that area on Census Day and who had no usual place of residence elsewhere in Canada, as well as those considered to be 'non-permanent residents'. National Household Survey (NHS) provides demographic data for various levels of geography, including provinces and territories, census metropolitan areas/census agglomerations, census divisions, census subdivisions, census tracts, federal electoral districts and health regions. In order to provide a comprehensive overview of an area, this product presents data from both the NHS and the Census. NHS data topics include immigration and ethnocultural diversity; aboriginal peoples; education and labor; mobility and migration; language of work; income and housing. 2011 Census data topics include population and dwelling counts; age and sex; families, households and marital status; structural type of dwelling and collectives; and language. The data are collected for private dwellings occupied by usual residents. A private dwelling is a dwelling in which a person or a group of persons permanently reside. Information for the National Household Survey does not include information for collective dwellings. Collective dwellings are dwellings used for commercial, institutional or communal purposes, such as a hotel, a hospital or a work camp. **Benefits** - Useful for canada public health stakeholders, for public health specialist or specialized public and other interested parties. for health surveillance and population health research. for monitoring, planning, implementation and evaluation of health-related programs. media agencies may use the health regions data to raise awareness about health, an issue of concern to all canadians. giving the addition of longitude and latitude in some of the datasets the data can be useful to transpose the values into geographical representations. the fields descriptions along with the dataset description are useful for the user to quickly understand the data and the dataset. **License Information** The use of John Snow Labs datasets is free for personal and research purposes. For commercial use please subscribe to the [Data Library](https://www.johnsnowlabs.com/marketplace/) on John Snow Labs website. The subscription will allow you to use all John Snow Labs datasets and data packages for commercial purposes. **Included Datasets** - [Canadian Population and Dwelling by FSA 2011](https://www.johnsnowlabs.com/marketplace/canadian-population-and-dwelling-by-fsa-2011) - This Canadian Census dataset covers data on population, total private dwellings and private dwellings occupied by usual residents by forward sortation area (FSA). It is enriched with the percentage of the population or dwellings versus the total amount as well as the geographical area, province, and latitude and longitude. The whole Canada's population is marked as 100, referring to 100% for the percentages. - [Detailed Canadian Population Statistics by CMAs and CAs 2011](https://www.johnsnowlabs.com/marketplace/detailed-canadian-population-statistics-by-cmas-and-cas-2011) - This dataset covers the population statistics of Canada by Census Metropolitan Areas (CMAs) and Census Agglomerations (CAs). It is categorized also by citizen/immigration status, ethnic origin, religion, mobility, education, language, work, housing, income etc. There is detailed characteristics categorization within these stated categories that are in 5 layers. - [Detailed Canadian Population Statistics by FED 2011](https://www.johnsnowlabs.com/marketplace/detailed-canadian-population-statistics-by-fed-2011) - This dataset covers the population statistics of Canada from 2011 by Federal Electoral District of 2013 Representation Order. It is categorized also by citizen/immigration status, ethnic origin, religion, mobility, education, language, work, housing, income etc. There is detailed characteristics categorization within these stated categories that are in 5 layers. - [Detailed Canadian Population Statistics by Health Region 2011](https://www.johnsnowlabs.com/marketplace/detailed-canadian-population-statistics-by-health-region-2011) - This dataset covers the population statistics of Canada by health region. It is categorized also by citizen/immigration status, ethnic origin, religion, mobility, education, language, work, housing, income etc. There is detailed characteristics categorization within these stated categories that are in 5 layers. - [Detailed Canadian Population Statistics by Province 2011](https://www.johnsnowlabs.com/marketplace/detailed-canadian-population-statistics-by-province-2011) - This dataset covers the population statistics of Canada by provinces and territories. It is categorized also by citizen/immigration status, ethnic origin, religion, mobility, education, language, work, housing, income etc. There is detailed characteristics categorization within these stated categories that are in 5 layers. **Data Engineering Overview** **We deliver high-quality data** - Each dataset goes through 3 levels of quality review - 2 Manual reviews are done by domain experts - Then, an automated set of 60+ validations enforces every datum matches metadata & defined constraints - Data is normalized into one unified type system - All dates, unites, codes, currencies look the same - All null values are normalized to the same value - All dataset and field names are SQL and Hive compliant - Data and Metadata - Data is available in both CSV and Apache Parquet format, optimized for high read performance on distributed Hadoop, Spark & MPP clusters - Metadata is provided in the open Frictionless Data standard, and its every field is normalized & validated - Data Updates - Data updates support replace-on-update: outdated foreign keys are deprecated, not deleted **Our data is curated and enriched by domain experts** Each dataset is manually curated by our team of doctors, pharmacists, public health & medical billing experts: - Field names, descriptions, and normalized values are chosen by people who actually understand their meaning - Healthcare & life science experts add categories, search keywords, descriptions and more to each dataset - Both manual and automated data enrichment supported for clinical codes, providers, drugs, and geo-locations - The data is always kept up to date – even when the source requires manual effort to get updates - Support for data subscribers is provided directly by the domain experts who curated the data sets - Every data source’s license is manually verified to allow for royalty-free commercial use and redistribution. **Need Help?** If you have questions about our products, contact us at [info@johnsnowlabs.com](mailto:info@johnsnowlabs.com).

Databricks 收录

学生课堂行为数据集 (SCB-dataset3)

学生课堂行为数据集(SCB-dataset3)由成都东软学院创建，包含5686张图像和45578个标签，重点关注六种行为：举手、阅读、写作、使用手机、低头和趴桌。数据集覆盖从幼儿园到大学的不同场景，通过YOLOv5、YOLOv7和YOLOv8算法评估，平均精度达到80.3%。该数据集旨在为学生行为检测研究提供坚实基础，解决教育领域中学生行为数据集的缺乏问题。

arXiv 收录

ICESat-2 Data

ICESat-2 Data 是由美国国家航空航天局（NASA）发布的卫星数据集，主要用于全球冰层和陆地高程的测量。该数据集包括高精度激光测高数据，用于研究冰川、海冰、植被和地形变化。

icesat-2.gsfc.nasa.gov 收录

poi

本项目收集国内POI兴趣点，当前版本数据来自于openstreetmap。

github 收录

NuminaMath-CoT

数据集包含约86万道数学题目，每道题目的解答都采用思维链（Chain of Thought, CoT）格式。数据来源包括中国高中数学练习题以及美国和国际数学奥林匹克竞赛题目。数据主要从在线考试试卷PDF和数学讨论论坛收集。处理步骤包括从原始PDF中进行OCR识别、分割成问题-解答对、翻译成英文、重新对齐以生成CoT推理格式，以及最终答案格式化。

huggingface 收录