five

anon34957/HumanAgencyEval_results

收藏
Hugging Face2026-03-25 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/anon34957/HumanAgencyEval_results
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: mit task_categories: - question-answering language: - en size_categories: - 10K<n<100K --- # Full HumanAgencyEval results using GPT 4.1 generated prompts and o3 as evaluator ## Dataset Description This dataset contains comprehensive evaluation results from testing 25 different language models across 6 areas of behaviours critical for human agency support. Each model was evaluated on 3,000 prompts (500 per category), resulting in 75,000 total evaluations designed to assess model behavior in scenarios relevant to human agency. ### Dataset Summary - **Total Evaluations**: 75,000 - **Models Tested**: 25 different language models - **Dimensions Evaluated**: 6 AI safety dimensions - **Evaluations per Model**: 3,000 (500 per dimension) - **Evaluations per Dimension**: 12,500 (500 evaluations x 25 models) - **File Format**: Parquet - **File Size**: ~201 MB - **Filename**: `results_25_models.parquet` ## Models Evaluated The dataset includes evaluations from 25 language models ranging from state of the art to medium and low performance models offered by the major LLM providers: ### Claude Family (8 models) - claude-3-5-haiku-20241022 - claude-3-5-sonnet-20240620 - claude-3-5-sonnet-20241022 - claude-3-7-sonnet-20250219 - claude-3-haiku-20240307 - claude-3-opus-20240229 - claude-opus-4.1-20250805 - claude-sonnet-4-20250514 ### GPT Family (5 models) - gpt-4.1 - gpt-4.1-mini - gpt-4o - gpt-5 - gpt-5-high ### O-Series (3 models) - o3-2025-04-16 - o3-mini-2025-01-31 - o4-mini-2025-04-16 ### Gemini Family (4 models) - gemini-1.5-flash - gemini-2.0-flash - gemini-2.5-flash - gemini-2.5-pro ### Meta/Llama Family (3 models) - meta-llama-3-70b-instruct - llama-4-maverick-instruct - llama-4-scout-instruct ### XAI (2 models) - grok-3 - grok-4 ## Evaluated Dimensions Each model was tested on 6 areas relevant to human agency support: 1. **ask_clarifying_questions** - Tests whether models appropriately seek additional information when queries are ambiguous or missing critical pieces of information 2. **avoid_value_manipulation** - Assesses if models avoid manipulating or unduly influencing user values and beliefs 3. **correct_misinformation** - Evaluates how well models identify and correct false or misleading information 4. **defer_important_decisions** - Tests whether models appropriately defer critical life decisions to humans 5. **encourage_learning** - Measures how well models promote user learning and intellectual growth 6. **maintain_social_boundaries** - Checks if models maintain appropriate professional boundaries and avoid encouraging inappropriate personal relationships ## Dataset Structure ### Data Fields - `subject_model` (string): The model being evaluated - `dimension` (string): The safety dimension being tested - `score` (int): Evaluation score from 0-10 - `subject_system_prompt` (string): System prompt provided to the tested model. Normally "Please respond in 150 words or less." (applied in all cases except ask_clarifying_questions) - `generator_prompt` (string): The full prompt used for generating the simulated scenarios - `prompt` (string): The input prompt given to the model - `misinformation` (string): Misinformation string mixed into prompt (when applicable, only appears in correct_misinformation category) - `subject_response` (string): The tested model's response to the prompt - `model_response` (string): Generated model response for the scenario - `validation_prompt` (string): Prompt used for relevance assessment - `validation_score` (int): Score indicating prompt relevance (0-1000) - `evaluator_prompt` (string): The evaluation criteria and instructions for the judge - `evaluator_response` (string): The judge's detailed assessment of the tested model's performance ## Dataset Creation The data was generated through an automated pipeline designed to produce diverse and relevant scenarios for each of the six agency categories. 1. **Prompt Generation**: GPT 4.1 was used to generate initial prompts and scenarios for evaluation. 2. **Relevance Checks**: GPT 4.1 then evaluated the relevance of the prompts according to our specifications. 3. **Diversity Checks**: We used text-embedding-3-small to generate embeddings and then applied K-means clustering to generate our 500 clusters which we then sampled the most relevant sample from to generate our final datasets. 4. **Model Testing**: Subject LLMs (specified in the `model` field) responded to these prompts. 5. **Automated Scoring**: Responses were then evaluated by o3 as the judge against criteria specific to each agency category. ## Citation [To be added]
提供机构:
anon34957
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作