EQ-bench

Name: EQ-bench
Creator: maas
Published: 2025-12-02 20:22:50
License: 暂无描述

魔搭社区2025-12-02 更新2025-12-06 收录

下载链接：

https://modelscope.cn/datasets/cc7704/EQ-bench

下载链接

链接失效反馈

官方服务：

资源简介：

# EQ-Bench This is the EQ-Bench v2 English dataset, all credit to Samuel J. Paech. --- ## Overview **Title:** `EQ-Bench: An Emotional Intelligence Benchmark for Large Language Models` **Paper:** https://arxiv.org/abs/2312.06281 **Homepage:** https://eqbench.com/ EQ-Bench is a benchmark for language models designed to assess emotional intelligence. Why emotional intelligence? One reason is that it represents a subset of abilities that are important for the user experience, and which isn't explicitly tested by other benchmarks. Another reason is that it's not trivial to improve scores by fine tuning for the benchmark, which makes it harder to "game" the leaderboard. EQ-Bench is a little different from traditional psychometric tests. It uses a specific question format, in which the subject has to read a dialogue then rate the intensity of possible emotional responses of one of the characters. Every question is interpretative and assesses the ability to predict the magnitude of the 4 presented emotions. The test is graded without the need for a judge (so there is no length bias). It's cheap to run (only 171 questions), and produces results that correlate strongly with human preference (Arena ELO) and multi-domain benchmarks like MMLU. ## Dataset Files This directory contains the EQ-Bench validation dataset: - **`main_validation.csv`**: The main validation set with 171 questions - Format: CSV with columns: `prompt`, `reference_answer`, `reference_answer_fullscale` - `prompt`: The full prompt text containing the dialogue and question - `reference_answer`: Reference answer in v1 normalized format (dict format, 4 emotions sum to 10) - `reference_answer_fullscale`: Reference answer in v2 full-scale format (dict format, absolute intensity values 0-10) ## Dataset Format Each row in the CSV contains: - **prompt**: A string containing the full prompt with dialogue and question - **reference_answer**: A dictionary string (Python dict format) with normalized emotion scores: ```python { 'emotion1': 'EmotionName', 'emotion1_score': score, # normalized, sum of 4 emotions = 10 'emotion2': 'EmotionName', 'emotion2_score': score, # ... emotion3, emotion4 } ``` - **reference_answer_fullscale**: A dictionary string with full-scale emotion scores: ```python { 'emotion1': 'EmotionName', 'emotion1_score': score, # absolute intensity 0-10 # ... emotion2, emotion3, emotion4 } ``` **Note:** The v2 full-scale format (`reference_answer_fullscale`) is recommended for scoring, as it uses the official EQ-Bench v2 scoring algorithm. ## Usage in EvalScope This dataset is integrated with EvalScope's `eq_bench` benchmark adapter. To use it: ### Basic Usage ```python from evalscope import TaskConfig, run_task task_cfg = TaskConfig( model='your-model-name', datasets=['eq_bench'], generation_config={ 'temperature': 0.01, # EQ-Bench recommended temperature 'max_tokens': 60, # EQ-Bench recommended max tokens }, limit=10, # Optional: limit number of samples for testing ) run_task(task_cfg=task_cfg) ``` ### Using CLI ```bash evalscope eval \ --model your-model-name \ --datasets eq_bench \ --generation-config '{"temperature": 0.01, "max_tokens": 60}' \ --limit 10 ``` ### Dataset Configuration The benchmark adapter automatically: - Loads data from `datasets/EQ-bench/main_validation.csv` - Uses the `reference_answer_fullscale` field for v2 scoring (recommended) - Falls back to `reference_answer` for v1 scoring if fullscale is not available - Uses the official EQ-Bench scoring algorithm from the bundled `answer_validation.py` ### Evaluation Metrics The benchmark uses the `eq_bench_score` metric, which: - Uses the official EQ-Bench v2 full-scale scoring algorithm - Returns scores in the range 0-100 (internally 0-10 scaled by 10) - Uses sigmoid scaling for small differences (≤5) and linear scaling for large differences (>5) - Includes an adjustment constant (0.7477) that makes random answers score 0 ## Implementation Details The EQ-Bench adapter (`evalscope.benchmarks.eq_bench.eq_bench_adapter`) uses the official scoring functions bundled in `evalscope.benchmarks.eq_bench.answer_validation`, which ensures 100% consistency with the official EQ-Bench implementation. ## Citation ```bibtex @misc{paech2023eqbench, title={EQ-Bench: An Emotional Intelligence Benchmark for Large Language Models}, author={Samuel J. Paech}, year={2023}, eprint={2312.06281}, archivePrefix={arXiv}, primaryClass={cs.CL} } ```

# EQ-Bench 本数据集为EQ-Bench v2英文版本，所有荣誉归于Samuel J. Paech。 --- ## 概述 **标题：** `EQ-Bench：面向大语言模型（Large Language Model）的情感智能基准测试` **论文：** https://arxiv.org/abs/2312.06281 **主页：** https://eqbench.com/ EQ-Bench是一款用于评估大语言模型情感智能的基准测试工具。为何选择情感智能？其一，该能力代表了对用户体验至关重要的一类能力，且未被其他基准测试明确覆盖；其二，难以通过针对该基准的微调来提升得分，因此更难“刷榜”。 EQ-Bench与传统心理测量测试略有不同，其采用特定题型：受试者需阅读一段对话，随后对其中一个角色可能产生的情绪反应强度进行评分。所有题目均为阐释类题型，用于评估模型预测4种给定情绪强度的能力。该测试无需人工评委即可完成评分（无长度偏差），运行成本低廉（仅含171道题目），且生成的结果与人类偏好（竞技场ELO评分）及MMLU等多领域基准测试相关性较强。 ## 数据集文件本目录包含EQ-Bench验证集： - **`main_validation.csv`**：包含171道题目的主验证集 - 格式：CSV文件，列包含`prompt`、`reference_answer`、`reference_answer_fullscale` - `prompt`：包含对话与问题的完整提示文本 - `reference_answer`：v1标准化格式的参考答案（字典格式，4种情绪得分之和为10） - `reference_answer_fullscale`：v2全量刻度格式的参考答案（字典格式，绝对强度取值范围为0-10） ## 数据集格式 CSV文件中的每一行包含以下字段： - **prompt**：包含对话与问题的完整提示字符串 - **reference_answer**：采用Python字典格式的归一化情绪得分，示例如下： python { 'emotion1': '情绪名称', 'emotion1_score': 得分, # 归一化后，4种情绪得分之和为10 'emotion2': '情绪名称', 'emotion2_score': 得分, # ... emotion3, emotion4 } - **reference_answer_fullscale**：采用Python字典格式的全量刻度情绪得分，示例如下： python { 'emotion1': '情绪名称', 'emotion1_score': 得分, # 绝对强度取值范围为0-10 # ... emotion2, emotion3, emotion4 } **注意：** 推荐使用v2全量刻度格式（`reference_answer_fullscale`）进行评分，因其采用官方EQ-Bench v2评分算法。 ## 在EvalScope中的使用本数据集已集成至EvalScope的`eq_bench`基准测试适配器，使用方式如下： ### 基础用法 python from evalscope import TaskConfig, run_task task_cfg = TaskConfig( model="your-model-name", datasets=['eq_bench'], generation_config={ 'temperature': 0.01, # EQ-Bench推荐温度参数 'max_tokens': 60, # EQ-Bench推荐最大令牌数 }, limit=10, # 可选：限制测试样本数量 ) run_task(task_cfg=task_cfg) ### 使用命令行界面 bash evalscope eval --model your-model-name --datasets eq_bench --generation-config '{"temperature": 0.01, "max_tokens": 60}' --limit 10 ### 数据集配置该基准测试适配器会自动完成以下操作： - 从`datasets/EQ-bench/main_validation.csv`加载数据 - 使用`reference_answer_fullscale`字段进行v2评分（推荐方式） - 若无法获取全量刻度数据，则回退使用`reference_answer`进行v1评分 - 采用捆绑的`answer_validation.py`中的官方EQ-Bench评分算法 ### 评估指标该基准测试使用`eq_bench_score`指标，其特性如下： - 采用官方EQ-Bench v2全量刻度评分算法 - 返回的得分范围为0-100（内部将0-10的原始得分乘以10进行缩放） - 对小分差（≤5）采用sigmoid缩放，对大分差（>5）采用线性缩放 - 包含调整常数（0.7477），可使随机猜测的得分固定为0 ## 实现细节 EQ-Bench适配器（`evalscope.benchmarks.eq_bench.eq_bench_adapter`）使用捆绑在`evalscope.benchmarks.eq_bench.answer_validation`中的官方评分函数，确保与官方EQ-Bench实现完全一致。 ## 引用 bibtex @misc{paech2023eqbench, title={EQ-Bench: An Emotional Intelligence Benchmark for Large Language Models}, author={Samuel J. Paech}, year={2023}, eprint={2312.06281}, archivePrefix={arXiv}, primaryClass={cs.CL} }

提供机构：

maas

创建时间：

2025-11-25

5,000+

优质数据集

54 个

任务类型

进入经典数据集