soumyaBharadwaj/ErrorBench
收藏Hugging Face2026-03-28 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/soumyaBharadwaj/ErrorBench
下载链接
链接失效反馈官方服务:
资源简介:
---
language: en
license: mit
pretty_name: ErrorBench
size_categories:
- 1K<n<10K
task_categories:
- text-generation
- text-classification
task_ids:
- text2text-generation
- rdf-to-text
- multi-class-classification
paperswithcode_id: errorbench
dataset_info:
citation: |
@inproceedings{bharadwaj2026errorbench,
title={ErrorBench: Fine-Grained Error Analysis of Multi-Family LLMs in Data-to-Text Generation},
author={Soumya Bharadwaj and Ashish Anand},
booktitle={International Joint Conference on Neural Networks (IJCNN)},
year={2026}
}
tags:
- data-to-text
- llm-evaluation
- Error-Analysis
- Hallucination
- Faithfullness
- DBpedia
- Text-Generation
---
# ErrorBench: Fine-Grained Error Analysis of Multi-Family LLMs in Data-to-Text Generation
## Dataset Summary
ErrorBench is a human-annotated, span-level benchmark for analyzing generation errors in Large Language Models (LLMs) for Data-to-Text (D2T) generation. The dataset consists of sentences generated from structured DBpedia triples and annotated with fine-grained span-level error labels across 10 error categories.
The dataset was introduced in our IJCNN 2026 paper, **"ErrorBench: Fine-Grained Error Analysis of Multi-Family LLMs in Data-to-Text Generation."** It is designed to support detailed analysis of generation failures such as hallucination, omission, prompt leakage, incoherence, and entity or relation errors, which are not captured by traditional surface-level metrics like BLEU or ROUGE.
Each input tuple is paired with outputs from multiple LLMs, enabling cross-model comparative error analysis, meta-evaluation, and the development of automatic error detection systems for LLM-generated text.
ErrorBench provides a reusable span-annotated benchmark for studying reliability, faithfulness, and error behavior across LLM families and model scales in structured data-to-text generation.
## Example of Span-Level Error Annotation

*Figure: Example of span-level error annotation showing a Llama2-7B output with multiple simultaneous errors: Partial Entity Mismatch, Relation Ambiguity, Addition, and Spell/Format errors, alongside the grounded input tuple.*
---
## Dataset Statistics
* Total tuples: **224**
* Total generated sentences: **6,048**
* Error-annotated sentences: **≈2,557**
* Total annotated error spans: **≈4,732**
* LLM families evaluated: **9**
* Total model variants: **27**
* Annotation type: **Manual span-level annotation**
---
## LLM Families and Model Scales

*Figure: Overview of model families and parameter scales considered in our comparative error analysis. For each family, colors transition from light to dark, representing lower to higher parameter models.*
---
## Task Description
The task is **Data-to-Text Generation** from structured tuples of the form:
(Entity1, Entity1 Type, Relation, Entity2, Entity2 Type)
Models generate a sentence describing the relation between the entities. The generated sentence is then manually annotated for span-level errors.
---
## Dataset Structure
Each instance in the dataset contains the following fields:
| Field | Description |
| ----------------- | -------------------------------------- |
| id | Global unique numeric ID for the instance|
| uid | Original model-specific instance ID |
| model | Model name that generated the sentence |
| sentence | Model generated sentence |
| tuple.E1 | Entity 1 |
| tuple.E1_TYPE | Entity 1 type |
| tuple.RELATION | Relation |
| tuple.E2 | Entity 2 |
| tuple.E2_TYPE | Entity 2 type |
| errors.label | Error span label |
| errors.spans | Character span indices |
| errors.text | Text span containing the error |
| errors.error_type | Error category |
**Note:**
Since the dataset combines outputs from 27 different models, the original tuple IDs repeat across models.
Therefore:
- `id` = global unique dataset ID
- `uid` = original model-specific instance identifier (ModelName_TupleID)
---
## Example Instance
```json
{
"id": "17",
"model": "DeepSeekr1_1.5b",
"sentence": "Ray Mendoza was a trainer for Villano IV, who led him against Spanish colonial rule. Moving towards Spain, he played a crucial role in battling correctly with other soldiers of his time.",
"tuple": {
"E1": "Villano IV",
"E1_TYPE": "Person",
"RELATION": "trainer",
"E2": "Ray Mendoza",
"E2_TYPE": "Person"
},
"errors": [
{
"label": "ErrorSpan",
"spans": [52, 94],
"text": "who led him against Spanish colonial rule.",
"error_type": "Addition"
},
{
"label": "ErrorSpan",
"spans": [95, 115],
"text": "Moving towards Spain",
"error_type": "Addition"
}
],
"uid": "DeepSeekr1_1.5b_178"
}
```
---
## Instance Identification
Each tuple appears once for each model. Therefore, the dataset contains multiple entries corresponding to the same input tuple but generated by different models.
To avoid ID conflicts:
- `id` is a globally unique identifier for each dataset entry.
- `uid` identifies the original tuple and model combination in the format:
---
## Error Taxonomy (10 Categories)
| Error Type | Description |
| ----------------------- | ---------------------------------------- |
| Entity Omission | Required entity missing from sentence |
| Relation Omission | Relation not expressed |
| Addition | Extra information not present in tuple |
| Repetition | Repeated tokens or phrases |
| Spelling/Format Drift | Formatting or spelling issues |
| Prompt Echo | Prompt or reasoning leakage |
| Relation Ambiguity | Relation expressed unclearly |
| Entity Type Change | Entity type incorrectly expressed |
| Incoherence | Sentence is meaningless or contradictory |
| Partial Entity Mismatch | Entity partially incorrect |
---
## Annotation Process
All generated sentences were manually annotated using span-level annotation.
Annotation procedure:
1. The minimal erroneous span was identified.
2. The span was assigned one of the 10 error categories.
3. Missing entities or relations were annotated using special tags:
* [MISSING_E1]
* [MISSING_E2]
* [MISSING_RELATION]
Annotation was performed using the BRAT annotation tool by an expert annotator following strict guidelines to ensure consistency.
---
## Evaluation Metrics
The dataset supports evaluation using two metrics:
**Total Error Span Rate (TESR):**
Average number of error spans per sentence.
**Generation Quality Index (GQI):**
Percentage of completely error-free sentences.
TESR measures error density, while GQI measures overall generation success rate.
---
## Intended Use
ErrorBench can be used for:
* Evaluating Data-to-Text generation systems
* Fine-grained error analysis of LLMs
* Hallucination detection
* Faithfulness evaluation
* Training automatic error detection models
* Studying scaling effects in LLM generation
* Prompt engineering research
* Benchmarking structured text generation systems
---
## Dataset Creation Pipeline
1. Structured triples were collected from DBpedia.
2. Sentences were generated using 27 LLM variants from 9 model families.
3. Each model generated one sentence per tuple.
4. Generated sentences were manually annotated.
5. Span-level error labels were assigned using a 10-category taxonomy.
---
## Citation
If you use this dataset, please cite:
```
@inproceedings{bharadwaj2026errorbench,
title={ErrorBench: Fine-Grained Error Analysis of Multi-Family LLMs in Data-to-Text Generation},
author={Bharadwaj, Soumya and Anand, Ashish},
booktitle={International Joint Conference on Neural Networks (IJCNN)},
year={2026},
organisation={IEEE}
}
```
---
## License
This dataset is released under the MIT License.
---
## Contact
Soumya Bharadwaj
Indian Institute of Technology Guwahati
India
---
## Tags
Data-to-Text, LLM Evaluation, Error Analysis, Hallucination, Faithfulness, Benchmark Dataset, DBpedia, Text Generation
提供机构:
soumyaBharadwaj
搜集汇总
数据集介绍

构建方式
在数据到文本生成领域,ErrorBench的构建体现了严谨的学术流程。该数据集从DBpedia中提取了224个结构化三元组作为基础输入。随后,研究团队选取了涵盖9个不同家族的27个大语言模型变体,针对每个三元组生成对应的描述性句子,共计获得6048个生成结果。为确保标注质量,专家标注员采用BRAT工具对约2557个句子进行了精细的跨度级人工标注,依据一套包含10个类别的错误分类法,识别并标记了约4732个错误跨度,从而构建了一个支持细粒度错误分析的高质量基准。
特点
ErrorBench的核心特征在于其精细的跨度级错误标注体系。与依赖表层指标的传统基准不同,该数据集对生成文本中的错误进行了字符级别的定位与分类,涵盖了实体遗漏、关系模糊、信息添加、不连贯等十种具体错误类型。数据集囊括了多个模型家族及不同参数规模的生成结果,为跨模型比较和错误模式研究提供了丰富素材。其标注不仅包含错误文本片段,还通过特殊标签处理了缺失信息,这种设计使得对模型生成内容的忠实度、可靠性及具体错误行为的分析达到了前所未有的细致程度。
使用方法
该数据集为数据到文本生成及相关研究提供了多维度的评估框架。研究者可利用其进行大语言模型的细粒度错误剖析,深入探究幻觉、忠实度等特定问题。数据集支持计算总错误跨度率(TESR)和生成质量指数(GQI)等量化指标,以评估模型的错误密度与整体生成成功率。此外,标注的跨度级错误数据可用于训练自动错误检测模型,或作为元评估基准,检验其他自动评估指标的效力。在具体使用中,用户可通过数据集中提供的唯一标识符(id, uid)定位特定实例,结合输入三元组、模型输出及对应的错误标注信息,开展系统性分析。
背景与挑战
背景概述
ErrorBench数据集由Soumya Bharadwaj和Ashish Anand于2026年提出,并在国际神经网络联合会议(IJCNN)上正式发布。该数据集聚焦于数据到文本生成领域,旨在对大型语言模型在结构化数据转换过程中的生成错误进行细粒度分析。其核心研究问题在于突破传统表面级评估指标(如BLEU或ROUGE)的局限,通过人工标注的跨度级错误标签,深入揭示模型在忠实性、连贯性等方面的具体缺陷。该数据集的构建基于DBpedia的三元组,涵盖了9个模型家族的27个变体,为比较不同模型在实体遗漏、关系模糊、幻觉添加等十类错误上的表现提供了标准化基准,显著推动了生成文本可靠性评估的研究进程。
当前挑战
在数据到文本生成领域,传统评估方法难以捕捉模型输出中存在的语义不忠实、逻辑不一致等深层错误,这构成了该领域的主要挑战。ErrorBench通过定义十类细粒度错误类型,如实体遗漏、关系模糊、添加幻觉等,旨在系统性地解决对生成文本进行精确、可解释错误分析的难题。在数据集构建过程中,挑战主要体现在确保跨度级人工标注的一致性与准确性,需要专家标注者严格遵循指南,在大量模型生成的句子中识别并分类最小错误跨度,同时处理如实体缺失等特殊情况的标注,这一过程耗时且对标注质量要求极高。
常用场景
经典使用场景
在数据到文本生成领域,ErrorBench数据集为研究者提供了精细化的错误分析框架。该数据集通过人工标注的跨度级错误标签,能够系统性地评估不同大语言模型在结构化数据转换过程中的生成质量。经典应用场景包括跨模型比较分析,研究者可以基于统一的错误分类体系,量化比较GPT、Llama、Gemini等九大家族共27个模型变体在实体遗漏、关系模糊、幻觉添加等十类错误上的表现差异,从而揭示不同架构与规模模型的可靠性特征。
解决学术问题
该数据集有效解决了传统自动评价指标难以捕捉深层语义错误的问题。传统BLEU、ROUGE等表面相似度度量无法识别事实性错误与逻辑矛盾,ErrorBench通过建立细粒度错误分类体系,为数据到文本生成领域的忠实性评估提供了标准化解决方案。其意义在于推动了生成质量评估从粗粒度转向细粒度分析,使研究者能够精确诊断模型在实体关系表达、信息完整性、逻辑连贯性等方面的系统性缺陷,为构建可信赖的文本生成系统奠定了评估基础。
衍生相关工作
该数据集催生了多个方向的重要研究工作。在评估方法层面,研究者基于其标注体系开发了新型忠实性度量指标,如生成质量指数与错误跨度率,这些指标已被纳入后续基准测试的标准评估协议。在模型改进方面,相关研究利用ErrorBench的错误模式分析结果,指导了检索增强生成系统的架构设计,显著降低了实体混淆与关系歧义的发生频率。此外,该数据集还促进了跨任务迁移研究,其错误分类框架被适配应用于对话生成、代码生成等领域的可靠性评估体系构建。
以上内容由遇见数据集搜集并总结生成



