five

anon34957/HumanAgencyEval_Human_Annotations

收藏
Hugging Face2026-03-25 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/anon34957/HumanAgencyEval_Human_Annotations
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: mit language: - en pretty_name: Human and model comparative evaluations dataset for HumanAgencyEval. size_categories: - 10K<n<100K --- # Human annotations and LLM judge comparative Dataset ## Dataset Description This dataset contains 60,000 evaluated AI assistant responses across 6 dimensions of behaviour relevant to human agency support, with both model-based and human annotations. Each example includes evaluations from 4 different frontier LLM models. We also provide responses provided by human evaluators for 900 of these examples (150 per dimension), with comments and reasoning provided by human judges. ## Evaluated Dimensions The dataset evaluates AI assistant responses across 6 key behavioral dimensions: 1. **ask_clarifying_questions** - Whether the assistant appropriately seeks additional information when user queries are ambiguous or missing critical information 2. **avoid_value_manipulation** - Whether the assistant avoids manipulating or unduly influencing a user's values and beliefs 3. **correct_misinformation** - How well the assistant identifies and corrects false or misleading information 4. **defer_important_decisions** - Whether the assistant appropriately defers critical life decisions to humans rather than making recommendations 5. **encourage_learning** - How well the assistant promotes user learning, understanding, and intellectual growth 6. **maintain_social_boundaries** - Whether the assistant maintains appropriate professional boundaries and avoids endorsing inappropriate personal relationships ## Dataset Structure ### Data Fields Each row contains: - `id`: Unique identifier for each example - `dim`: The category being evaluated - `subject_model`: The model that generated the response (e.g. o4-mini-2025-04-16) - `prompt`: The user's input query - `subject_response`: The AI assistant's response to the prompt - `model_annotations`: Evaluations from all 4 AI models: - `evaluator`: Model name (gemini/claude/o3/4.1) - `score`: Numeric score (0-10 scale) - `evaluator_deductions`: Letter codes for identified issues - `evaluator_response`: Detailed JSON evaluation with assessment and issues - `human_annotations` (list): Human evaluations when available (900 examples have these): - `evaluator`: indicator that the evaluator is a human - `annotator_id`: Unique identifier for the annotator - `score`: Numeric score - `evaluator_deductions`: Letter codes for identified issues - `task_comments`: Detailed qualitative feedback explaining the evaluation - `deduction_explanations`: Explanations for any deductions made - `human_scores_agg`: Aggregated human scores - `mean`: Average score from human annotators - `count`: Number of human annotators for this example ## Example Structure Every row contains evaluations from all 4 model evaluators, and a subset of rows (900 out of 60000) contain evaluations provided by human crowdworkers. A given row might in the 'model_annotations' column show: - GPT-4.1 giving a score of 10 with no issues - Claude giving a score of 8 with issue "D" - Gemini giving a score of 0 with issue "A" - O3 giving a score of 10 with no issues And in the separate 'human_annotations' column show the numeric score given by the human and the specific deductions they selected, as well as comments motivating their choice. ## Citation [To be added]
提供机构:
anon34957
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作