bermaneh/pde-llm-eval-results-v2
收藏Hugging Face2026-04-27 更新2026-05-03 收录
下载链接:
https://hf-mirror.com/datasets/bermaneh/pde-llm-eval-results-v2
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mit
tags:
- pde-llm-eval
- free-gen
- v3
---
# pde-llm-eval-results-v2
Free-gen PDE eval: 10 models, v3 dataset (128 rows, 8 conditions). 9 existing models have 144 rows (v2+v3), DeepSeek-R1-Distill-Qwen-32B has 128 rows (v3 only).
## Dataset Info
- **Rows**: 1424
- **Columns**: 21
## Columns
| Column | Type | Description |
|--------|------|-------------|
| title | Value('string') | *No description provided* |
| pde_class | Value('string') | *No description provided* |
| mod_type | Value('string') | *No description provided* |
| gt_pde | Value('string') | *No description provided* |
| gt_method | Value('string') | *No description provided* |
| gt_behavior | Value('string') | *No description provided* |
| gt_valid | Value('bool') | *No description provided* |
| model_response | Value('string') | Full model output (never truncated) |
| parsed_pde | Value('string') | Extracted PDE type from response |
| parsed_method | Value('string') | Extracted numerical method(s) |
| parsed_behavior | Value('string') | Extracted physical process(es) |
| parsed_valid | Value('string') | Extracted validity answer |
| finish_reason | Value('string') | vLLM stop reason (stop/length) |
| model | Value('string') | *No description provided* |
| pde_match | Value('int64') | Binary keyword match for PDE type |
| pde_embed_sim | Value('float64') | *No description provided* |
| method_any_match | Value('int64') | 1 if any GT method token found in response |
| method_recall | Value('float64') | Fraction of GT method tokens found |
| behavior_any_match | Value('int64') | 1 if any GT behavior token found |
| behavior_recall | Value('float64') | Fraction of GT behavior tokens found |
| valid_match | Value('int64') | Binary match for validity field |
## Generation Parameters
```json
{
"script_name": "run_eval.py",
"model": "multi (10 models)",
"description": "Free-gen PDE eval: 10 models, v3 dataset (128 rows, 8 conditions). 9 existing models have 144 rows (v2+v3), DeepSeek-R1-Distill-Qwen-32B has 128 rows (v3 only).",
"experiment_name": "pde-llm-eval",
"job_id": "torch:7248133",
"cluster": "torch",
"artifact_status": "final",
"canary": false,
"hyperparameters": {},
"input_datasets": []
}
```
## Usage
```python
from datasets import load_dataset
dataset = load_dataset("bermaneh/pde-llm-eval-results-v2", split="train")
print(f"Loaded {len(dataset)} rows")
```
---
提供机构:
bermaneh



