Tuyuanpeng/Reason2Gen

Name: Tuyuanpeng/Reason2Gen
Creator: Tuyuanpeng
Published: 2026-03-24 04:41:34
License: 暂无描述

Hugging Face2026-03-24 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/Tuyuanpeng/Reason2Gen

下载链接

链接失效反馈

官方服务：

资源简介：

# Reason2Gen GPT-based Full Evaluation This document describes how to use a GPT-5.1 model (via the OpenAI API) to evaluate image-generation outputs on the Reason2Gen benchmark. The evaluation script: - Reads each task directory under your **Reason2Gen** benchmark. - For every sample, loads: - The *question* / input prompt. - The *target* image (ground-truth). - The *generated* image from your method (e.g., Bagel / FLUX2). - Asks GPT to judge whether the generated image correctly solves the puzzle or instruction, given the prompt and the reference target image. - Counts **+1** for a correct image and **0** for incorrect, then reports accuracy per task and overall. --- ## 1. Directory Layout Benchmark directory looks like this: ```text <base_dir>/ hanoi/ hanoi.json question/ question_0000.png ... answer/ answer_0000.png ... clock/ clock.json question/ answer/ ... ``` Your method’s generated images are assumed to be in: ```text <result_root>/ hanoi/ <method_name>/ edited/ answer_0000_<suffix>.png # generated image for that sample clock/ <method_name>/ edited/ ... ... ``` Where: - `base_dir` = path to Reason2Gen benchmark (JSON + question/answer images). - `result_root` = root directory where you saved outputs. - `method_name` = name of your method (e.g., `bagel`, `flux2`). - The script matches JSON entries to files by `image_target` or `target_image`, then looks for an edited image with a fixed suffix (you can change this). --- ## 2. OpenAI API Configuration You need: - An **OpenAI API key** with access to the `gpt-5.1` (or similar) model. - Python `openai` package (>= 1.0.0 style client). Set your API key via environment variable: ```bash export OPENAI_API_KEY="sk-xxxxxxxxxxxxxxxx" ``` Or in Windows PowerShell: ```powershell $env:OPENAI_API_KEY="sk-xxxxxxxxxxxxxxxx" ``` The script uses the official client, for example: ```python from openai import OpenAI client = OpenAI() # reads OPENAI_API_KEY from env response = client.chat.completions.create( model="gpt-5.1", messages=[...], ) ``` You can change the model name (e.g., `gpt-4.1-mini`) in the script if desired. --- ## 3. Evaluation Script (`reason2gen_gpt_eval.py`) Place `reason2gen_gpt_eval.py` next to this README. The script will: 1. Discover all tasks under `base_dir` (each subfolder with a `<task>.json`). 2. For each task: - Load the JSON list of samples. - For each sample: - Read instruction / textual description. - Locate `question` image (optional for GPT context). - Locate `answer` (target) image. - Locate generated image from your result folder. - If any image is missing, skip that sample. - Build a GPT prompt including: - Task name. - Natural-language instruction / description from JSON. - Short description of the evaluation rule (exact matching vs. conceptual). - Optionally, some *few-shot examples* (you can add). - Send **all three images** as `image_url`/`input_image` parts in the Chat Completions API: - question image - target (answer) image - generated image - Parse GPT’s response as a strict JSON decision: - `{"label": 1}` → correct - `{"label": 0}` → incorrect 3. Accumulate: - `correct_count[task]` - `total_count[task]` 4. Print: - Accuracy per task. - Macro-average accuracy over all tasks. --- ## 4. How GPT Is Prompted The core idea: - GPT sees **both** the target and your generated image. - GPT is instructed: - Compare generated vs. target. - Decide if the generated image is *semantically correct* for the puzzle, not just visually similar. - Output **only** a JSON structure with `label` = `1` or `0`. Example system message (simplified): ```json { "role": "system", "content": "You are an automatic judge for puzzle-like images..." } ``` Example user message (simplified): ```json { "role": "user", "content": [ {"type": "text", "text": "...task description..."}, {"type": "image_url", "image_url": {"url": "file://.../question.png"}}, {"type": "image_url", "image_url": {"url": "file://.../answer.png"}}, {"type": "image_url", "image_url": {"url": "file://.../generated.png"}} ] } ``` The assistant must answer: ```json {"label": 1} ``` or: ```json {"label": 0} ``` If parsing fails, that sample is counted as incorrect by default (configurable). --- ## 5. Running the Evaluator ### 5.1. Install dependencies Create a Python environment and install: ```bash pip install openai pillow tqdm ``` If you use local file paths for images with the OpenAI API, ensure your environment (e.g., where the script runs) supports sending those images either as bytes or via hosted URLs. The reference implementation in `reason2gen_gpt_eval.py` uses local file reading + `input_image` uploads via the client. ### 5.2. Example command ```bash python reason2gen_gpt_eval.py --base_dir /path/to/Reason2Gen --result_root /path/to/Reason2Gen_outputs --method_name bagel --image_suffix _bagel.png --model gpt-5.1 --max_samples_per_task 0 ``` In this repo (copy-paste): ```bash export OPENAI_API_KEY="sk-xxxxxxxxxxxxxxxx" python /mnt/bn/yuanpengtu/svgthink/benchmark/Eval/Reason2GenBench/reason2gen_gpt_eval.py \ --base_dir /mnt/bn/yuanpengtu/svgthink/benchmark/Reason2Gen \ --result_root /mnt/bn/yuanpengtu/svgthink/benchmark/Eval/Reason2Gen_outputs \ --method_name bagel \ --image_suffix _bagel.png \ --model gpt-5.1 \ --max_samples_per_task 0 \ --json_mode ``` Arguments: - `--base_dir`: root of the Reason2Gen benchmark. - `--result_root`: root of all generated outputs. - `--method_name`: subdirectory under each task where your edited images live. - `--image_suffix`: suffix appended to the target filename to get your generated filename. - `--model`: which OpenAI vision-capable model to use. - `--max_samples_per_task`: optional cap; `0` or omitted means “all”. You can also restrict to specific tasks: ```bash python reason2gen_gpt_eval.py --base_dir /path/to/Reason2Gen --result_root /path/to/Reason2Gen_outputs --method_name flux2 --tasks hanoi clock pipe ``` --- ## 6. Output Format At the end, the script prints something like: ```text ===== Per-task accuracy ===== Task hanoi: 73.2% ( 293 / 400 ) Task clock: 65.0% ( 130 / 200 ) Task pipe: 70.5% ( 141 / 200 ) ... ===== Overall ===== Total: 69.1% ( 564 / 816 ) across 7 tasks ``` It can also optionally write results to a JSON file: ```json { "per_task": { "hanoi": {"correct": 293, "total": 400, "accuracy": 0.7325}, "clock": {"correct": 130, "total": 200, "accuracy": 0.65}, "...": {} }, "overall": { "correct": 564, "total": 816, "accuracy": 0.691 }, "config": { "base_dir": "...", "result_root": "...", "method_name": "bagel", "model": "gpt-5.1" } } ``` (Enable this by passing `--save_json /path/to/results.json`.) --- ## 7. Notes & Tips - **Cost & speed**: Vision GPT calls with 3 images per sample can be expensive for large benchmarks. You can: - Lower `max_samples_per_task`. - Use a cheaper model like `gpt-4.1-mini`. - Cache judgments (script supports optional cache file). - **Determinism**: Set `temperature=0` for the GPT calls to get deterministic behavior. - **Robustness**: - If a sample’s images are missing, it is skipped and not counted. - If GPT output cannot be parsed as JSON with `label`, that sample is treated as incorrect. - **Strict vs. lenient criteria**: - You can adjust the instructions to GPT to be more strict (exact final state) or more lenient (any valid solution). --- ## 8. Minimal Configuration Checklist 1. Reason2Gen benchmark present at `BASE_DIR`: - Contains subfolders (e.g., `hanoi`, `clock`, …). - Each subfolder has `<task>.json`, `question/`, `answer/`. 2. Generated images present at `RESULT_ROOT`: - `RESULT_ROOT/<task>/<method_name>/edited/…`. 3. Set the environment variable: ```bash export OPENAI_API_KEY="sk-xxxxxxxxxxxxxxxx" ``` 4. Run evaluator: ```bash python reason2gen_gpt_eval.py --base_dir BASE_DIR --result_root RESULT_ROOT --method_name METHOD --image_suffix _METHOD.png --model gpt-5.1 ``` 5. Read accuracies from terminal (and JSON if saved). --- ## 9. Extending the Script - **Different file naming scheme**: - Modify how output filenames are derived from `image_target`. - **Extra context in prompts**: - You can inject additional text from JSON (e.g., reasoning steps) into the GPT prompt. - **Multiple methods comparison**: - Run the script separately for each `method_name` and compare overall accuracy.

# Reason2Gen 基于GPT的全量评估（Reason2Gen GPT-based Full Evaluation）本文档阐述了如何借助OpenAI API调用GPT-5.1大语言模型 (LLM)，对Reason2Gen基准测试中的图像生成结果开展评估。该评估脚本具备以下流程： - 读取您的**Reason2Gen**基准测试下的每个任务目录。 - 针对每个样本，加载以下内容： - 问题（question）/输入提示词 - 目标图像（基准真值，ground-truth） - 您所使用方法生成的图像（例如Bagel / FLUX2） - 要求GPT结合提示词与参考目标图像，判断生成图像是否正确解决了谜题或指令要求。 - 正确样本计为+1，错误样本计为0，随后输出各任务及整体的准确率。 --- ## 1. 目录结构基准测试目录的结构如下： text <base_dir>/ hanoi/ hanoi.json question/ question_0000.png ... answer/ answer_0000.png ... clock/ clock.json question/ answer/ ... 您的方法生成的图像应存放于以下路径： text <result_root>/ hanoi/ <method_name>/ edited/ answer_0000_<suffix>.png # 对应样本的生成图像 clock/ <method_name>/ edited/ ... ... 其中： - `base_dir`：Reason2Gen基准测试的路径（包含JSON文件与question/、answer/子目录）。 - `result_root`：您保存生成结果的根目录。 - `method_name`：您所用方法的名称（例如`bagel`、`flux2`）。 - 脚本将通过`image_target`或`target_image`匹配JSON条目与文件，随后查找带有固定后缀的编辑后图像（您可自行修改该后缀）。 --- ## 2. OpenAI API 配置您需要准备： - 拥有`gpt-5.1`（或同类模型）访问权限的**OpenAI API密钥**。 - Python `openai`库（版本需≥1.0.0，采用新版客户端风格）。可通过环境变量设置API密钥： bash export OPENAI_API_KEY="sk-xxxxxxxxxxxxxxxx" Windows PowerShell环境下则执行： powershell $env:OPENAI_API_KEY="sk-xxxxxxxxxxxxxxxx" 该脚本使用官方客户端，示例代码如下： python from openai import OpenAI client = OpenAI() # 从环境变量中读取OPENAI_API_KEY response = client.chat.completions.create( model="gpt-5.1", messages=[...], ) 您可根据需求在脚本中修改模型名称（例如`gpt-4.1-mini`）。 --- ## 3. 评估脚本（`reason2gen_gpt_eval.py`）将`reason2gen_gpt_eval.py`置于本README文件同级目录下。该脚本将执行以下操作： 1. 遍历`base_dir`下的所有任务（每个带有`<task>.json`的子文件夹即为一个任务）。 2. 针对每个任务： - 加载样本的JSON列表。 - 针对每个样本： - 读取指令/文本描述。 - 查找问题图像（可作为GPT的可选上下文）。 - 查找目标（回答）图像。 - 从结果文件夹中查找您的方法生成的图像。 - 若任意图像缺失，则跳过该样本。 - 构建GPT提示词，包含以下内容： - 任务名称。 - JSON文件中提供的自然语言指令/描述。 - 评估规则的简要说明（精确匹配 vs 概念匹配）。 - 可选的少样本示例（您可自行添加）。 - 在聊天补全API中以`image_url`/`input_image`的形式上传**全部三张图像**： - 问题图像 - 目标（回答）图像 - 生成图像 - 将GPT的响应解析为严格的JSON格式决策： - `{"label": 1}` → 样本正确 - `{"label": 0}` → 样本错误 3. 统计： - 各任务的正确样本数`correct_count[task]` - 各任务的总样本数`total_count[task]` 4. 输出以下内容： - 各任务的准确率。 - 所有任务的宏平均准确率。 --- ## 4. GPT提示词逻辑核心思路如下： - GPT将同时看到目标图像与您生成的图像。 - 向GPT下达的指令包括： - 对比生成图像与目标图像。 - 判断生成图像是否在语义上符合谜题要求，而非仅视觉相似。 - 仅输出包含`label`为`1`或`0`的JSON结构。示例系统提示词（简化版）： json { "role": "system", "content": "You are an automatic judge for puzzle-like images..." } 示例用户提示词（简化版）： json { "role": "user", "content": [ {"type": "text", "text": "...task description..."}, {"type": "image_url", "image_url": {"url": "file://.../question.png"}}, {"type": "image_url", "image_url": {"url": "file://.../answer.png"}}, {"type": "image_url", "image_url": {"url": "file://.../generated.png"}} ] } 助手必须输出以下格式之一： json {"label": 1} 或： json {"label": 0} 若解析失败，则默认将该样本计为错误（可配置）。 --- ## 5. 运行评估器 ### 5.1 安装依赖创建Python环境并安装以下依赖： bash pip install openai pillow tqdm 若您使用OpenAI API的本地图像文件路径，请确保运行脚本的环境支持以字节流或托管URL的形式上传这些图像。`reason2gen_gpt_eval.py`的参考实现采用本地文件读取+通过客户端上传`input_image`的方式。 ### 5.2 示例命令 bash python reason2gen_gpt_eval.py --base_dir /path/to/Reason2Gen --result_root /path/to/Reason2Gen_outputs --method_name bagel --image_suffix _bagel.png --model gpt-5.1 --max_samples_per_task 0 在本仓库中执行的完整命令（可直接复制）： bash export OPENAI_API_KEY="sk-xxxxxxxxxxxxxxxx" python /mnt/bn/yuanpengtu/svgthink/benchmark/Eval/Reason2GenBench/reason2gen_gpt_eval.py --base_dir /mnt/bn/yuanpengtu/svgthink/benchmark/Reason2Gen --result_root /mnt/bn/yuanpengtu/svgthink/benchmark/Eval/Reason2Gen_outputs --method_name bagel --image_suffix _bagel.png --model gpt-5.1 --max_samples_per_task 0 --json_mode 参数说明： - `--base_dir`：Reason2Gen基准测试的根目录。 - `--result_root`：所有生成结果的根目录。 - `--method_name`：每个任务子目录下存放编辑后图像的子文件夹名称。 - `--image_suffix`：目标文件名后追加的后缀，用于匹配您的生成图像文件名。 - `--model`：所用的OpenAI视觉大语言模型。 - `--max_samples_per_task`：可选的单任务样本上限；设置为`0`或省略该参数则表示使用全部样本。您也可以仅对指定任务执行评估： bash python reason2gen_gpt_eval.py --base_dir /path/to/Reason2Gen --result_root /path/to/Reason2Gen_outputs --method_name flux2 --tasks hanoi clock pipe --- ## 6. 输出格式脚本运行结束后将输出类似以下内容： text ===== Per-task accuracy ===== Task hanoi: 73.2% ( 293 / 400 ) Task clock: 65.0% ( 130 / 200 ) Task pipe: 70.5% ( 141 / 200 ) ... ===== Overall ===== Total: 69.1% ( 564 / 816 ) across 7 tasks 您也可以选择将结果写入JSON文件，格式如下： json { "per_task": { "hanoi": {"correct": 293, "total": 400, "accuracy": 0.7325}, "clock": {"correct": 130, "total": 200, "accuracy": 0.65}, "...": {} }, "overall": { "correct": 564, "total": 816, "accuracy": 0.691 }, "config": { "base_dir": "...", "result_root": "...", "method_name": "bagel", "model": "gpt-5.1" } } 通过传入`--save_json /path/to/results.json`即可启用该功能。 --- ## 7. 注意事项与技巧 - **成本与速度**：每个样本上传3张图像的视觉GPT调用对于大规模基准测试而言成本较高，您可通过以下方式优化： - 降低`max_samples_per_task`参数的值。 - 使用更廉价的模型，例如`gpt-4.1-mini`。 - 启用评估结果缓存（脚本支持可选的缓存文件）。 - **确定性**：将GPT调用的`temperature`参数设置为0，可获得确定性的评估结果。 - **鲁棒性**： - 若某个样本的图像缺失，则跳过该样本且不计入总样本数。 - 若GPT的输出无法解析为包含`label`字段的JSON格式，则将该样本视为错误样本。 - **严格与宽松评估标准**： - 您可调整向GPT下达的指令，以设置更严格的评估标准（例如要求精确的最终状态）或更宽松的标准（例如接受任意合法的解决方案）。 --- ## 8. 最小化配置检查清单 1. Reason2Gen基准测试已存放于`BASE_DIR`： - 包含子文件夹（例如`hanoi`、`clock`等）。 - 每个子文件夹均包含`<task>.json`、`question/`与`answer/`子目录。 2. 生成图像已存放于`RESULT_ROOT`： - 路径格式为`RESULT_ROOT/<task>/<method_name>/edited/…`。 3. 设置环境变量： bash export OPENAI_API_KEY="sk-xxxxxxxxxxxxxxxx" 4. 运行评估脚本： bash python reason2gen_gpt_eval.py --base_dir BASE_DIR --result_root RESULT_ROOT --method_name METHOD --image_suffix _METHOD.png --model gpt-5.1 5. 从终端输出中读取准确率结果（若启用则同时读取JSON保存的结果）。 --- ## 9. 扩展脚本功能 - **自定义文件命名规则**： - 修改从`image_target`推导输出文件名的逻辑。 - **向提示词中注入额外上下文**： - 您可将JSON文件中的额外文本（例如推理步骤）注入GPT提示词。 - **多方法对比**： - 为每个`method_name`分别运行脚本，随后对比各方法的整体准确率。

提供机构：

Tuyuanpeng

5,000+

优质数据集

54 个

任务类型

进入经典数据集