8Fai/Healthybench-German
收藏Hugging Face2026-04-27 更新2026-05-03 收录
下载链接:
https://hf-mirror.com/datasets/8Fai/Healthybench-German
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mit
task_categories:
- text-generation
- question-answering
language:
- de
tags:
- health
- safety
- benchmark
- evaluation
- german
- triage
size_categories:
- n<1K
dataset_info:
splits:
- name: test
num_examples: 750
num_bytes: 46182
download_size: 46182
dataset_size: 46182
configs:
- config_name: default
data_files:
- split: test
path: data/test-*
---
# Healthybench-German
Healthybench-German is a German-language health-and-safety evaluation dataset for benchmarking models on cautious, user-facing guidance in everyday wellbeing, triage, first-aid, medication-safety, and crisis-escalation scenarios.
The dataset contains 750 evaluation examples in a single `test` split. Each example includes one German user prompt, rubric items for scoring, a category, a difficulty label, a concise reference answer, and a longer reference solution. The benchmark is designed for evaluating safe communication and escalation behavior rather than diagnostic precision.
## Overview
Healthybench-German focuses on practical safety-sensitive situations where a model should respond carefully, avoid false certainty, identify red flags, and point to professional help when appropriate. It is intended as a compact benchmark for German-language evaluation workflows.
## Scope
Healthybench-German covers:
* emergency escalation
* first aid
* medication safety
* infection-related advice
* heat and dehydration
* allergy response
* gastrointestinal complaints
* mental-health crisis support
The benchmark is not a substitute for medical training and should not be interpreted as a source of clinical truth. It is an evaluation set for model behavior.
## Benchmark Design
Each record contains:
* `conversation`: one user message in German
* `rubric_items`: criteria for scoring safe and useful responses
* `use_case`: currently `health_guidance`
* `type`: currently `good_faith`
* `difficulty`: evaluation difficulty label
* `category`: health-and-safety subdomain
* `reference_answer`: concise target answer
* `reference_solution`: more complete reference response
* `canary_string`: canary for contamination filtering
## Intended Use
Healthybench-German is intended for:
* benchmarking German health-safety behavior
* testing escalation quality and guardrailed responses
* comparing prompt or policy variants
* regression testing in safety-sensitive model workflows
* evaluating whether models avoid overconfident medical claims
## Limitations
This dataset is synthetic. It is useful for measuring consistency, clarity, escalation quality, and obvious safety failures, but it does not measure full clinical competence or real-world care quality.
The benchmark intentionally prioritizes safe, bounded guidance. It should not be used as a replacement for expert review in medical or crisis-response systems.
## Files
* `data/test-00000-of-00001.parquet`: primary parquet release
* `healthybench_german_eval.jsonl`: companion JSONL export
* `metadata.json`: dataset metadata and validation summary
## Data Integrity
The dataset is generated with local validation to ensure:
* 750 total records
* 750 unique prompts
* 750 unique IDs
## Example Usage
```python
from datasets import load_dataset
dataset = load_dataset("parquet", data_files={"test": "Healthybench-German/data/test-*.parquet"})
```
## Notes
The dataset includes a canary string to simplify downstream filtering and contamination checks. Because the examples concern health and safety, benchmark scores should be interpreted alongside expert judgment rather than as a complete measure of safety performance.
提供机构:
8Fai



