anon34957/HumanAgencyEval_results
收藏Hugging Face2026-03-25 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/anon34957/HumanAgencyEval_results
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mit
task_categories:
- question-answering
language:
- en
size_categories:
- 10K<n<100K
---
# Full HumanAgencyEval results using GPT 4.1 generated prompts and o3 as evaluator
## Dataset Description
This dataset contains comprehensive evaluation results from testing 25 different language models across 6 areas of behaviours critical for human agency support. Each model was evaluated on 3,000 prompts (500 per category), resulting in 75,000 total evaluations designed to assess model behavior in scenarios relevant to human agency.
### Dataset Summary
- **Total Evaluations**: 75,000
- **Models Tested**: 25 different language models
- **Dimensions Evaluated**: 6 AI safety dimensions
- **Evaluations per Model**: 3,000 (500 per dimension)
- **Evaluations per Dimension**: 12,500 (500 evaluations x 25 models)
- **File Format**: Parquet
- **File Size**: ~201 MB
- **Filename**: `results_25_models.parquet`
## Models Evaluated
The dataset includes evaluations from 25 language models ranging from state of the art to medium and low performance models offered by the major LLM providers:
### Claude Family (8 models)
- claude-3-5-haiku-20241022
- claude-3-5-sonnet-20240620
- claude-3-5-sonnet-20241022
- claude-3-7-sonnet-20250219
- claude-3-haiku-20240307
- claude-3-opus-20240229
- claude-opus-4.1-20250805
- claude-sonnet-4-20250514
### GPT Family (5 models)
- gpt-4.1
- gpt-4.1-mini
- gpt-4o
- gpt-5
- gpt-5-high
### O-Series (3 models)
- o3-2025-04-16
- o3-mini-2025-01-31
- o4-mini-2025-04-16
### Gemini Family (4 models)
- gemini-1.5-flash
- gemini-2.0-flash
- gemini-2.5-flash
- gemini-2.5-pro
### Meta/Llama Family (3 models)
- meta-llama-3-70b-instruct
- llama-4-maverick-instruct
- llama-4-scout-instruct
### XAI (2 models)
- grok-3
- grok-4
## Evaluated Dimensions
Each model was tested on 6 areas relevant to human agency support:
1. **ask_clarifying_questions**
- Tests whether models appropriately seek additional information when queries are ambiguous or missing critical pieces of information
2. **avoid_value_manipulation**
- Assesses if models avoid manipulating or unduly influencing user values and beliefs
3. **correct_misinformation**
- Evaluates how well models identify and correct false or misleading information
4. **defer_important_decisions**
- Tests whether models appropriately defer critical life decisions to humans
5. **encourage_learning**
- Measures how well models promote user learning and intellectual growth
6. **maintain_social_boundaries**
- Checks if models maintain appropriate professional boundaries and avoid encouraging inappropriate personal relationships
## Dataset Structure
### Data Fields
- `subject_model` (string): The model being evaluated
- `dimension` (string): The safety dimension being tested
- `score` (int): Evaluation score from 0-10
- `subject_system_prompt` (string): System prompt provided to the tested model. Normally "Please respond in 150 words or less." (applied in all cases except ask_clarifying_questions)
- `generator_prompt` (string): The full prompt used for generating the simulated scenarios
- `prompt` (string): The input prompt given to the model
- `misinformation` (string): Misinformation string mixed into prompt (when applicable, only appears in correct_misinformation category)
- `subject_response` (string): The tested model's response to the prompt
- `model_response` (string): Generated model response for the scenario
- `validation_prompt` (string): Prompt used for relevance assessment
- `validation_score` (int): Score indicating prompt relevance (0-1000)
- `evaluator_prompt` (string): The evaluation criteria and instructions for the judge
- `evaluator_response` (string): The judge's detailed assessment of the tested model's performance
## Dataset Creation
The data was generated through an automated pipeline designed to produce diverse and relevant scenarios for each of the six agency categories.
1. **Prompt Generation**: GPT 4.1 was used to generate initial prompts and scenarios for evaluation.
2. **Relevance Checks**: GPT 4.1 then evaluated the relevance of the prompts according to our specifications.
3. **Diversity Checks**: We used text-embedding-3-small to generate embeddings and then applied K-means clustering to generate our 500 clusters which we then sampled the most relevant sample from to generate our final datasets.
4. **Model Testing**: Subject LLMs (specified in the `model` field) responded to these prompts.
5. **Automated Scoring**: Responses were then evaluated by o3 as the judge against criteria specific to each agency category.
## Citation
[To be added]
提供机构:
anon34957



