下载链接：

https://modelscope.cn/datasets/TIGER-Lab/StructEval

下载链接

链接失效反馈

官方服务：

资源简介：

# StructEval: A Benchmark for Structured Output Evaluation in LLMs StructEval is a benchmark dataset designed to evaluate the ability of large language models (LLMs) to generate and convert structured outputs across 18 different formats, and 44 types of tasks. It includes both renderable types (e.g., HTML, LaTeX, SVG) and non-renderable types (e.g., JSON, XML, TOML), supporting tasks such as format generation from natural language prompts and format-to-format conversion. ## Dataset Summary Each example in the dataset includes: - A unique task identifier and name. - The natural language query. - Feature-level requirements written in English. - Input and output format types. - A complete input query example. - Metrics for evaluation: raw output keywords, VQA question-answer pairs (for renderable types), and structural rendering flags. StructEval supports multi-metric evaluation pipelines, including visual rendering checks, VQA scoring via vision-language models, and path-based key validation for structured data. --- ## Supported Task Types - **Generation (Text → Format)**: LLM generates code in a target structured format from a natural language prompt. - **Conversion (Format → Format)**: LLM converts one structured format into another, e.g., HTML to React or JSON to YAML. --- ## Dataset Structure **Features:** | Feature | Type | Description | |-----------------------|-----------------------|---------------------------------------------------| | `task_id` | `string` | Unique identifier for the task. | | `query` | `string` | Task query provided to the LLM. | | `feature_requirements`| `string` | English description of key requirements. | | `task_name` | `string` | Short label for the task. | | `input_type` | `string` | Input format (e.g., `text`, `HTML`, `JSON`). | | `output_type` | `string` | Target output format. | | `query_example` | `string` | Full query string for evaluation. | | `VQA` | `list` of dicts | List of visual Q/A pairs (for renderable types). | | `raw_output_metric` | `list[string]` | Keywords or structural tokens for evaluation. | | `rendering` | `bool` | Whether the task output is visually rendered. | ## Usage Example ```python from datasets import load_dataset dataset = load_dataset("your-username/structeval") example = dataset["train"][0] print(example["query"]) print(example["VQA"]) ``` ## Citation Please cite us with: ``` @misc{yang2025structeval, title={StructEval: Benchmarking LLMs' Capabilities to Generate Structural Outputs}, author={Jialin Yang and Dongfu Jiang and Lipeng He and Sherman Siu and Yuxuan Zhang and Disen Liao and Zhuofeng Li and Huaye Zeng and Yiming Jia and Haozhe Wang and Benjamin Schneider and Chi Ruan and Wentao Ma and Zhiheng Lyu and Yifei Wang and Yi Lu and Quy Duc Do and Ziyan Jiang and Ping Nie and Wenhu Chen}, year={2025}, eprint={2505.20139}, archivePrefix={arXiv}, primaryClass={cs.SE}, doi={10.48550/arXiv.2505.20139} } ```

# StructEval：大语言模型（Large Language Model，LLM）结构化输出评估基准 StructEval是一款基准数据集，旨在评估大语言模型在18种不同格式与44类任务中生成与转换结构化输出的能力。该数据集涵盖可渲染格式（如HTML、LaTeX、SVG）与不可渲染格式（如JSON、XML、TOML），支持自然语言提示生成格式、格式间转换等任务。 ## 数据集概况数据集中的每个样本包含以下内容： - 唯一的任务标识符与任务名称 - 自然语言查询语句 - 英文编写的特征级需求说明 - 输入与输出格式类型 - 完整的输入查询示例 - 评估指标：原始输出关键词、视觉问答（Visual Question Answering，VQA）问答对（仅适用于可渲染格式）以及结构渲染标记。 StructEval支持多指标评估流程，包括视觉渲染检查、基于视觉语言模型的VQA评分，以及结构化数据的路径式键值验证。 --- ## 支持的任务类型 - **生成任务（文本→格式）**：大语言模型根据自然语言提示生成目标结构化格式的代码。 - **转换任务（格式→格式）**：大语言模型将一种结构化格式转换为另一种，例如HTML转React、JSON转YAML。 --- ## 数据集结构 **特征：** | 特征名称 | 数据类型 | 描述 | |------------------------|---------------------------|--------------------------------------------| | `task_id` | `string` | 任务的唯一标识符。 | | `query` | `string` | 提供给大语言模型的任务查询语句。 | | `feature_requirements` | `string` | 英文编写的核心需求说明。 | | `task_name` | `string` | 任务的简短标签。 | | `input_type` | `string` | 输入格式（例如`text`、`HTML`、`JSON`）。 | | `output_type` | `string` | 目标输出格式。 | | `query_example` | `string` | 用于评估的完整查询字符串。 | | `VQA` | 字典列表（list of dicts） | 视觉问答（Visual Question Answering, VQA）对列表（仅适用于可渲染格式）。 | | `raw_output_metric` | 字符串列表（list[string]） | 用于评估的关键词或结构标记。 | | `rendering` | 布尔值（bool） | 表示任务输出是否可视觉渲染。 | ## 使用示例 python from datasets import load_dataset dataset = load_dataset("your-username/structeval") example = dataset["train"][0] print(example["query"]) print(example["VQA"]) ## 引用格式 @misc{yang2025structeval, title={StructEval: Benchmarking LLMs' Capabilities to Generate Structural Outputs}, author={Jialin Yang and Dongfu Jiang and Lipeng He and Sherman Siu and Yuxuan Zhang and Disen Liao and Zhuofeng Li and Huaye Zeng and Yiming Jia and Haozhe Wang and Benjamin Schneider and Chi Ruan and Wentao Ma and Zhiheng Lyu and Yifei Wang and Yi Lu and Quy Duc Do and Ziyan Jiang and Ping Nie and Wenhu Chen}, year={2025}, eprint={2505.20139}, archivePrefix={arXiv}, primaryClass={cs.SE}, doi={10.48550/arXiv.2505.20139} }

应用场景：