fuyyckwhy/HS-Bench-results
收藏Hugging Face2026-02-03 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/fuyyckwhy/HS-Bench-results
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mit
task_categories:
- other
size_categories:
- 1M<n<10M
language:
- en
pretty_name: HS-Bench Results
tags:
- human-subject simulation
- benchmark
- evaluation
- PAS
- ECS
---
# HS-Bench Results
Benchmark results for **HumanStudy-Bench (HS-Bench)**: evaluation outputs from running AI agents through reconstructed human-subject experiments.
## Dataset description
This dataset contains:
- **12 studies** (`study_001`–`study_012`): replicated experiments from published human-subject research (cognition, strategic interaction, social psychology).
- **Multiple model × agent-design runs per study**: e.g. different LLMs (Mistral, GPT, Claude, Gemini, etc.) and presets (`v1-empty`, `v2-human`, `v3-human-plus-demo`, `v4-background`).
- **Per-run artifacts**:
- `full_benchmark.json` – full trial-level and aggregate results
- `evaluation_results.json` – PAS/ECS and related metrics
- `raw_responses.json` / `raw_responses.jsonl` – model outputs
- `detailed_stats.csv` – detailed statistics
## Metrics
- **PAS (Probability Alignment Score)**: Whether agents reach the same scientific conclusions as humans at the phenomenon level.
- **ECS (Effect Consistency Score)**: How closely agents reproduce the magnitude and pattern of human behavioral effects.
## Structure
```
study_001/
<model>_<preset>/
full_benchmark.json
evaluation_results.json
raw_responses.json
detailed_stats.csv
...
study_002/
...
...
```
## Citation
If you use this dataset or HumanStudy-Bench, please cite:
```bibtex
@misc{liu2026humanstudybenchaiagentdesign,
title={HumanStudy-Bench: Towards AI Agent Design for Participant Simulation},
author={Xuan Liu and Haoyang Shang and Zizhang Liu and Xinyan Liu and Yunze Xiao and Yiwen Tu and Haojian Jin},
year={2026},
eprint={2602.00685},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2602.00685},
}
```
**Paper**: [arXiv:2602.00685](https://arxiv.org/abs/2602.00685)
## Source
Generated by [HumanStudy-Bench](https://github.com/XuanL17/HumanStudy-Bench/) (HS-Bench). Use this dataset to compare agent designs, reproduce results, or run further analysis.
提供机构:
fuyyckwhy



