anon34957/HumanAgencyEval_Human_Annotations
收藏Hugging Face2026-03-25 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/anon34957/HumanAgencyEval_Human_Annotations
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mit
language:
- en
pretty_name: Human and model comparative evaluations dataset for HumanAgencyEval.
size_categories:
- 10K<n<100K
---
# Human annotations and LLM judge comparative Dataset
## Dataset Description
This dataset contains 60,000 evaluated AI assistant responses across 6 dimensions of behaviour relevant to human agency support, with both model-based and human annotations. Each example includes evaluations from 4 different frontier LLM models. We also provide responses provided by human evaluators for 900 of these examples (150 per dimension), with comments and reasoning provided by human judges.
## Evaluated Dimensions
The dataset evaluates AI assistant responses across 6 key behavioral dimensions:
1. **ask_clarifying_questions** - Whether the assistant appropriately seeks additional information when user queries are ambiguous or missing critical information
2. **avoid_value_manipulation** - Whether the assistant avoids manipulating or unduly influencing a user's values and beliefs
3. **correct_misinformation** - How well the assistant identifies and corrects false or misleading information
4. **defer_important_decisions** - Whether the assistant appropriately defers critical life decisions to humans rather than making recommendations
5. **encourage_learning** - How well the assistant promotes user learning, understanding, and intellectual growth
6. **maintain_social_boundaries** - Whether the assistant maintains appropriate professional boundaries and avoids endorsing inappropriate personal relationships
## Dataset Structure
### Data Fields
Each row contains:
- `id`: Unique identifier for each example
- `dim`: The category being evaluated
- `subject_model`: The model that generated the response (e.g. o4-mini-2025-04-16)
- `prompt`: The user's input query
- `subject_response`: The AI assistant's response to the prompt
- `model_annotations`: Evaluations from all 4 AI models:
- `evaluator`: Model name (gemini/claude/o3/4.1)
- `score`: Numeric score (0-10 scale)
- `evaluator_deductions`: Letter codes for identified issues
- `evaluator_response`: Detailed JSON evaluation with assessment and issues
- `human_annotations` (list): Human evaluations when available (900 examples have these):
- `evaluator`: indicator that the evaluator is a human
- `annotator_id`: Unique identifier for the annotator
- `score`: Numeric score
- `evaluator_deductions`: Letter codes for identified issues
- `task_comments`: Detailed qualitative feedback explaining the evaluation
- `deduction_explanations`: Explanations for any deductions made
- `human_scores_agg`: Aggregated human scores
- `mean`: Average score from human annotators
- `count`: Number of human annotators for this example
## Example Structure
Every row contains evaluations from all 4 model evaluators, and a subset of rows (900 out of 60000) contain evaluations provided by human crowdworkers. A given row might in the 'model_annotations' column show:
- GPT-4.1 giving a score of 10 with no issues
- Claude giving a score of 8 with issue "D"
- Gemini giving a score of 0 with issue "A"
- O3 giving a score of 10 with no issues
And in the separate 'human_annotations' column show the numeric score given by the human and the specific deductions they selected, as well as comments motivating their choice.
## Citation
[To be added]
提供机构:
anon34957



