odl-raiser/GGBench

Name: odl-raiser/GGBench
Creator: odl-raiser
Published: 2025-11-17 11:44:59
License: 暂无描述

Hugging Face2025-11-17 更新2026-01-03 收录

下载链接：

https://hf-mirror.com/datasets/odl-raiser/GGBench

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: mit language: - en tags: - generation - think-with-images - unified-multimodal-model pretty_name: GGBench task_categories: - visual-question-answering - text-to-image size_categories: - 1K<n<10K papers: - https://arxiv.org/abs/2511.11134 - https://huggingface.co/papers/2511.11134 arxiv: 2511.11134 --- ## Associated Paper This dataset is associated with the following paper: **GeoCraft / GGBench: A Comprehensive Benchmark for Geometry Construction** - arXiv: https://arxiv.org/abs/2511.11134 # GGBench Evaluation Script Documentation ## Directory Structure Overview ``` dataset/ ├── evaluate.py # Unified evaluation entry script ├── eval_prompts.py # Judge model prompt templates ├── GGBench_dataset.json # Official dataset (evaluation benchmark) ├── Q&A_image/ # Problem images and final result images ├── long_image/ # Long process images ├── eval_output/ # Default output directory for evaluation results └── requirements.txt # Python dependencies ``` ## 1. Data Download and Directory Setup 1. **Get the Dataset** ```bash git lfs install git clone https://huggingface.co/datasets/opendatalab-raiser/GGBench ``` 2. **Extract Dataset Archive** ```bash tar -xzvf dataset.tar.gz ``` After extraction, you will get the `dataset/` directory and original resource files. ## 2. Environment Setup 1. **Python Version**: Python 3.9 or higher is recommended. 2. **Install Dependencies**: ```bash pip install -r requirements.txt ``` ## 3. `evaluate.py` Configuration All configurable parameters are defined at the top of the script. Modify them according to your model and data paths before running: - `DATASET_PATH`: Path to GGBench dataset JSON, defaults to `./GGBench_dataset.json` in the current directory. - `MODEL_OUTPUT_PATH`: Path to the model output JSON (list structure) to be evaluated, default example is `test.json`. - `DATASET_ROOT` / `PRED_ROOT`: Root directories for original dataset resources and model-generated resources, used for resolving relative paths. - `OUTPUT_JSON` / `OUTPUT_JSONL`: Output locations for evaluation results. The script will automatically overwrite old entries and preserve other results. - `JUDGE_MODEL`, `JUDGE_URL`, `JUDGE_API_KEY`: Judge model name, base URL, and API key. The script uses OpenAI-compatible interface. - `MAX_WORKERS`: Number of concurrent threads. - `ENABLE_*` switches: Control whether to enable each evaluation module (final image judge, text chain judge, mid-process judge, LPIPS, PSNR, SSIM). - `LOG_FILE`: Log output location, defaults to `eval_output/evaluate.log`. ## 4. Input Data Requirements ### 4.1 GGBench Dataset (Ground Truth) - Located in `GGBench_dataset.json`, each sample contains fields such as `id`, `question`, `question_image`, `text_answer`, `res_image`, etc. - The script will automatically complete dataset information into model output items based on `id`. ### 4.2 Model Output File - The JSON pointed to by `MODEL_OUTPUT_PATH` needs to be a list, where each element contains: - `id`: Matches entries in the dataset. - Model-generated text fields (e.g., `output`) and image paths (e.g., `output_image_path` or `image_4`, etc.). - If there are intermediate process long images, provide `long_image_path`. - If paths are relative, they will be resolved based on `PRED_ROOT`. ## 5. Running the Script Execute in the `dataset/` directory: ```bash python evaluate.py ``` Script execution flow: 1. Initialize logging and dependencies. 2. Read GGBench dataset and model output, merge and complete information by `id`. 3. Execute each evaluation module according to switches: - **Final Image Judge** (`VLM_eval_image_result`): Calls judge model to compare reference image with model final image. - **Text Judge** (`eval_text_result`): Compares problem, reference answer with model text output. - **LPIPS / PSNR / SSIM**: Deep perception and pixel-level metrics, automatically handles fallback logic (missing images or exceptions return 0). - **Mid-Process Judge** (`Step Accuracy`, `Process Consistency`, `Problem-Solution Accuracy`): Evaluates multi-step generation process. 4. Write results to `OUTPUT_JSON` (list) and `OUTPUT_JSONL` (optional). If the target JSON already exists, the script will overwrite old records by `id`, and unevaluated entries will be preserved. After execution, you can check in the `eval_output/` directory: - `result.json`: Evaluation results summary. - `result.jsonl` (optional): Line-by-line JSON for streaming processing. - `score.json`: Aggregated total scores. - `evaluate.log`: Complete log including errors and warnings. ## 6. Common Scenarios and Recommendations - **Run Only Part of Modules**: Set the corresponding `ENABLE_*` constant to `False`, and the script will skip that evaluation. - **Batch Evaluation for Multiple Models**: Write an outer script to loop through modifying `MODEL_OUTPUT_PATH` and output paths, then call `python evaluate.py`. - **Judge Model Change**: Update `JUDGE_MODEL`, `JUDGE_URL`, `JUDGE_API_KEY`, and ensure the new model is compatible with `OpenAI`-style interface. - **Path Resolution Errors**: Check if `DATASET_ROOT` and `PRED_ROOT` are correct, ensure reference images and predicted images exist. When images are missing, the script will prompt in the log and set related metrics to 0. - **Incremental Evaluation**: Since the write logic overwrites by `id`, you can repeatedly run the script to update partial entries without manually cleaning old results. ## 7. Further Customization - Prompts are located in `eval_prompts.py`. If you need to adjust judge criteria or language, you can directly modify these templates. - To add new metrics, refer to existing function structures, add new modules in `evaluate.py` and enable them in the main flow. - If you want to save the complete output of the judge model, you can add `_raw_*` fields in each evaluation function or extend logging. --- If you encounter problems during use, first check `eval_output/evaluate.log` for error details, or debug with source code. Feel free to extend or integrate into larger evaluation workflows according to project needs. Happy evaluating!

提供机构：

odl-raiser

5,000+

优质数据集

54 个

任务类型

进入经典数据集