nace-ai/vgse-32b-coverage-eval-files

Name: nace-ai/vgse-32b-coverage-eval-files
Creator: nace-ai
Published: 2026-02-20 14:39:39
License: 暂无描述

Hugging Face2026-02-20 更新2026-04-05 收录

下载链接：

https://hf-mirror.com/datasets/nace-ai/vgse-32b-coverage-eval-files

下载链接

链接失效反馈

官方服务：

资源简介：

# VGSE-32B Coverage Evaluation Dataset This dataset contains traces, document images, and evaluation results for the **VGSE-32B** (Visual Grounded Structured Extraction) model coverage experiments. --- ## Directory layout ``` hf_dataset/ ├── data/data_32b/ # Raw agent traces (JSONL) ├── images/images_vgse_32b/ # Document page images referenced by traces └── results/results_32b/ # Evaluation results (JSON) ``` --- ## data/data_32b/ Three JSONL files, one per session. Each line is one agent trace. | File | Session | |------|---------| | `vgse_traces_ses_0_019c584f-9117-7680-b658-4a812e69285a.jsonl` | Session 0 | | `vgse_traces_ses_1_019c60a9-2913-77f3-b810-47b2d6abe871.jsonl` | Session 1 | | `vgse_traces_ses_2_019c6019-df0d-7371-bbdb-7bcfe7f84bbf.jsonl` | Session 2 | ### Trace record schema ```jsonc { "session_id": "string", "trace_id": "string", "trace_input": "...", // raw input to the agent "trace_output": "...", // raw output from the agent "e2e_eval": { ... }, // end-to-end eval metadata "qa_agent_pydantic": [ { "observation_id": "string", "answer_question_using_vgqa": [ { "observation_id": "string", "input": { "query": "string" }, "output": { "extracted_fields_with_inline_bboxes": "string" }, "qwen_vlm": [ { "observation_id": "string", "model": "Qwen/Qwen3-VL-32B-Instruct", "images": ["path/to/page.png", ...], "input": [ { "page_number": 0, "grounding_query": "..." }, ... ], "output": [ { "page_number": 0, "grounding_query": "...", ... }, ... ] } ], "vgse_vlm": [ { "observation_id": "string", "model": "nace-ai/VGSE-32B", "images": ["path/to/page.png", ...], "input": { "grounding_query": "..." }, "output": { ... } // structured JSON extraction with bbox fields } ] } ] } ] } ``` --- ## images/images_vgse_32b/ Document page images grouped by session: ``` images_vgse_32b/ ├── ses_0_019c584f-.../ (~2 904 PNG files) ├── ses_1_019c60a9-.../ (~2 810 PNG files) └── ses_2_019c6019-.../ (~2 919 PNG files) ``` Image paths stored in traces are relative to the workspace root and resolve to files under this directory. --- ## Results summary All results use **N ≈ 2 400** queries from three 32B sessions. Judge scores are 1–10; pass threshold is ≥ 7. | File | N | Query preservation pass | Answer comparison score | Answer comparison pass | |---|---|---|---|---| | eval_1_coverage_queries_results.json | 2 399 | 97.3 % | 8.58 | 82.6 % | | eval_2_coverage_query_rewrite_results_text_only.json | 2 403 | 99.9 % | 7.61 | 73.4 % | | eval_3-1_coverage_results_32b_no_grounding.json | 2 403 | — | 8.97 | 87.2 % | | eval_3-1_coverage_results_32b_with_grounding.json | 2 207 | — | 8.23 | 77.5 % | | eval_3-2_coverage_results_vgse_schema.json | 2 373 | — | 8.13 | 75.6 % | | eval_4_localization_q_loc_results.json | 2 384 | — | 9.35 | 94.7 % | --- ## results/results_32b/ ### eval_1_coverage_queries_results.json **Script:** `eval_coverage_queries.py` **N:** 2 399 entries **Results:** query preservation pass 97.3 % (mean score 9.44) · answer comparison pass 82.6 % (mean score 8.58) **Pipeline:** Qwen-32B generates grounding queries from existing traces → judge evaluates Q vs Q' preservation → Qwen-32B answers original query directly (control A) → judge compares control A vs VGSE extracted answer A'. ```jsonc { "session_id": "string", "trace_id": "string", "vgqa_observation_id": "string", "query": "string", // original user query Q "grounding_queries": [ { "page_number": 0, "grounding_query": "..." } ], "query_preservation": { // GPT-5.2 judge: does Q' cover Q? "score": "1-10", "pass": "bool", "reasoning": "string" }, "control_answer": "string", // Qwen-32B direct answer A "experiment_answer": "string", // VGSE extracted answer A' "answer_comparison": { // GPT-5.2 judge: A vs A' "score": "1-10", "pass": "bool", "reasoning": "string" } } ``` --- ### eval_2_coverage_query_rewrite_results_text_only.json **Script:** `eval_coverage_with_query_rewrite.py` (`GROUNDING_PLANNER_WITH_IMAGE=False`) **N:** 2 403 entries **Results:** query preservation pass 99.9 % (mean score 9.88) · answer comparison pass 73.4 % (mean score 7.61) **Pipeline:** GPT-5.2 rewrites query into a single grounding query (text-only, no images) → VGSE-32B runs the same query on every page concurrently → per-page outputs are merged into one JSON → judge compares control A (Qwen-32B) vs merged VGSE answer. ```jsonc { "session_id": "string", "trace_id": "string", "vgqa_observation_id": "string", "query": "string", "grounding_queries": [ { "page_number": 0, "grounding_query": "..." } ], "query_preservation": { // GPT-5.2 judge: Q vs GPT-rewritten Q' "score": "1-10", "pass": "bool", "reasoning": "string" }, "control_answer": "string", "experiment_vgse_outputs": [ // raw per-page VGSE output before merge { "page_number": 0, "grounding_query": "...", "vgse_output": "string" } ], "experiment_answer": "string", // merged VGSE output (JSON string) "answer_comparison": { "score": "1-10", "pass": "bool", "reasoning": "string" } } ``` --- ### eval_3-1_coverage_results_32b_no_grounding.json **Script:** `eval_coverage.py` (no inline bbox grounding in VQA prompt) **N:** 2 403 entries **Results:** answer comparison pass 87.2 % (mean score 8.97) **Pipeline:** Qwen-32B answers with plain VQA prompt (no grounding instruction) → VGSE-32B answers with same prompt → judge + grounding overlap metrics. ```jsonc { "session_id": "string", "trace_id": "string", "vgqa_observation_id": "string", "query": "string", "grounding_queries": [ { "page_number": 0, "grounding_query": "..." } ], "control_answer": "string", "experiment_answer": "string", "answer_comparison": { "score": "1-10", "pass": "bool", "reasoning": "string" }, "grounding_metrics": { "control_grounded_value_count": 0, "experiment_grounded_value_count": 0, "control_bbox_pattern_match_count": 0, // inline bbox citations in control answer "experiment_bbox_pattern_match_count": 0, // inline bbox citations in experiment answer "control_bbox_link_count": 0, "experiment_bbox_link_count": 0, "control_crop_accuracy": 0.0, // containment accuracy: cited text inside crop "experiment_crop_accuracy": 0.0, "same_extracted_values_count": 0, "overlapping_quotes_count": 0, // quoted text shared between control & experiment "same_extracted_values": [], "iou_pair_count": 0, "iou_matched_values_count": 0, "iou_mean": 0.0, "iou_values": [], "iou_match_rate_at_0_25": 0.0, "control_crop_checked": 0, "control_crop_correct": 0, "experiment_crop_checked": 0, "experiment_crop_correct": 0 } } ``` --- ### eval_3-1_coverage_results_32b_with_grounding.json **Script:** `eval_coverage.py` (with inline bbox grounding prompt) **N:** 2 207 entries **Results:** answer comparison pass 77.5 % (mean score 8.23) **Pipeline:** Same as the no-grounding variant but both models are prompted to add inline `[text](bbox://file#bbox=x0,y0,x1,y1)` citations to every extracted value. All `grounding_metrics` fields are fully populated. Schema is identical to eval_3-1 with all `grounding_metrics` fields populated. --- ### eval_3-2_coverage_results_vgse_schema.json **Script:** `eval_coverage_vgse_schema.py` **N:** 2 373 entries **Results:** answer comparison pass 75.6 % (mean score 8.13) · IoU mean 0.145 **Pipeline:** VGSE-32B is evaluated using its native structured-extraction schema prompt (`VGSE_SYSTEM_PROMPT_EXPERIMENTAL`, system + user message split) against Qwen-32B control answers that use `CONTROL_VQA_PROMPT_WITH_GROUNDING`. Overlap metrics compare Qwen's inline bbox citations against VGSE's `raw_text` fields. ```jsonc { "session_id": "string", "trace_id": "string", "vgqa_observation_id": "string", "query": "string", "control_answer": "string", // Qwen-32B with inline bbox citations "experiment_answer": "string", // VGSE-32B structured JSON output "answer_comparison": { "score": "1-10", "pass": "bool", "reasoning": "string" }, "grounding_metrics": { "control_bbox_pattern_match_count": 0, // inline bbox links in Qwen answer "experiment_bbox_pattern_match_count": 0, "control_crop_checked": 0, "control_crop_correct": 0, "control_crop_accuracy": 0.0, // containment accuracy for control citations "experiment_crop_checked": 0, "experiment_crop_correct": 0, "experiment_crop_accuracy": 0.0, // containment accuracy for VGSE raw_text/bbox "overlapping_quotes_count": 0, // quoted text matching between both answers "overlapping_quotes": [], "iou_values": [], // IoU per matched bbox pair "iou_mean": 0.0 // mean IoU across all pairs (0.145 overall) } } ``` --- ### eval_4_localization_q_loc_results.json **Script:** `eval_localization_q_loc.py` **N:** 2 384 entries (2 without a localization query) **Results:** answer comparison pass 94.7 % (mean score 9.35) · bbox containment accuracy 78.7 % **Pipeline:** Qwen-32B answers query directly (control A) → Qwen-32B converts A into a typed localization schema Q_loc → VGSE-32B fills in `raw_text`/`bbox`/`page_number` for each field → GPT-5.2 `judge_facts_only` compares A vs filled schema on factual equivalence and Q_loc consistency with Q → containment check verifies each `(raw_text, bbox)` pair in image crops. ```jsonc { "session_id": "string", "trace_id": "string", "vgqa_observation_id": "string", "query": "string", "control_answer": "string", "localization_query": "string | null", // generated localization query (null if no groundable fields) "localization_schema": { "...": "..." }, // JSON schema with pre-filled values (null if not generated) "localization_query_reason": "string", // why the schema was or was not generated "experiment_answer": "string", // VGSE-32B filled schema (JSON string) "answer_comparison": { // GPT-5.2 fact-only judge "score": "1-10", "pass": "bool", "reasoning": "string" }, "containment_metrics": { "experiment_field_evidence_count": 0, // fields with non-null raw_text+bbox "experiment_bbox_checked": 0, "experiment_bbox_contained": 0, "experiment_bbox_containment_accuracy": 0.0 } } ``` --- ## Judge model All automated evaluation steps use **GPT-5.2** via the OpenAI API. The shared judge prompt (`ANSWER_COMPARISON_PROMPT` in `judge.py`) scores semantic equivalence of two answers on a 1–10 scale; pass threshold is ≥ 7.

提供机构：

nace-ai

5,000+

优质数据集

54 个

任务类型

进入经典数据集