five

KIE-HVQA

收藏
魔搭社区2026-01-09 更新2025-09-13 收录
下载链接:
https://modelscope.cn/datasets/bytedance-research/KIE-HVQA
下载链接
链接失效反馈
官方服务:
资源简介:
# KIE-HVQA ## Seeing is Believing? Mitigating OCR Hallucinations in Multimodal Large Language Models > Data for the paper [Seeing is Believing? Mitigating OCR Hallucinations in Multimodal Large Language Models](http://arxiv.org/abs/2506.20168) ## What's New - **[2025/09/19]** The paper has been accepted by NeurIPS 2025 **Main** Conference. ## Introduction Recent advancements in multimodal large language models have significantly improved document understanding by integrating textual and visual information. However, in real-world scenarios—especially under conditions of visual degradation—existing models often fall short. They struggle to accurately perceive and handle visual ambiguities, leading to an overreliance on linguistic priors and misaligned visual-textual reasoning. This challenge in recognizing uncertainty frequently results in hallucinated content, particularly when providing precise answers is infeasible. To better illustrate and address this critical problem, we propose KIE-HVQA, the first dedicated benchmark designed to evaluate OCR hallucination in degraded document understanding. KIE-HVQA encompasses test samples from identity cards and invoices, enhanced with simulated real-world degradations that compromise OCR reliability. This benchmark uniquely facilitates the assessment of models’ abilities to discern reliable visual information under degraded conditions and respond appropriately, thereby emphasizing the challenge of avoiding hallucination when faced with uncertain data. <p align="center"> <img src="figures/intro.png" width="600"/> </p> ## Main Results We assessed several recent state-of-the-art models, including both open-source and proprietary ones. <p align="center"> <img src="figures/result.png" width="600"/> </p> ## Usage ### Evaluation for MLLMs results - Run the command to evaluate the LLMs results. ```bash cd KIE-HVQA python3 eval.py ``` ### Infer demo - Run the command to infer the pred. ```bash cd KIE-HVQA python3 infer_qwen.py ``` ## License The source code is licensed under the Apache License 2.0. The dataset is licensed under the [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/) License. ## Acknowledgement The dataset is built upon [OCRBench](https://github.com/Yuliang-Liu/MultimodalOCR),[WildReceipt] (https://arxiv.org/abs/2103.14470) ## Citation If you find this project useful in your research, please cite: ```bibtex @misc{he2025seeingbelievingmitigatingocr, title={Seeing is Believing? Mitigating OCR Hallucinations in Multimodal Large Language Models}, author={Zhentao He and Can Zhang and Ziheng Wu and Zhenghao Chen and Yufei Zhan and Yifan Li and Zhao Zhang and Xian Wang and Minghui Qiu}, year={2025}, eprint={2506.20168}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2506.20168}, } ```

# KIE-HVQA ## 眼见为实?缓解多模态大语言模型中的OCR(Optical Character Recognition)幻觉问题 > 本数据集配套论文《Seeing is Believing? Mitigating OCR Hallucinations in Multimodal Large Language Models》,链接:http://arxiv.org/abs/2506.20168 ## 最新动态 - **[2025/09/19]** 本论文已被NeurIPS 2025 **主会** 收录。 ## 研究背景 近年来,多模态大语言模型通过融合文本与视觉信息,在文档理解领域取得了显著进展。但在真实应用场景中,尤其是视觉质量退化的条件下,现有模型往往表现欠佳:它们难以准确感知并处理视觉歧义,过度依赖语言先验知识,导致视觉-文本推理出现错位。这类对不确定性的识别缺陷常引发幻觉内容生成,尤其在无法通过可靠信息给出精准答案时问题更为突出。 为更好地阐释并解决这一关键问题,我们提出KIE-HVQA——首个专为评估退化文档理解场景下OCR幻觉问题而设计的基准数据集。KIE-HVQA包含来自身份证与发票的测试样本,并通过模拟真实世界的视觉退化来降低OCR识别的可靠性。该基准可用于精准评估模型在视觉质量退化条件下辨别可靠视觉信息并做出合理响应的能力,从而凸显在不确定数据面前避免幻觉生成的研究挑战。 <p align="center"> <img src="figures/intro.png" width="600"/> </p> ## 主要实验结果 我们对多款当前主流的先进模型(包括开源与闭源模型)进行了性能评估。 <p align="center"> <img src="figures/result.png" width="600"/> </p> ## 使用方法 ### 多模态大语言模型结果评估 - 执行以下命令以评估大语言模型结果: bash cd KIE-HVQA python3 eval.py ### 推理演示 - 执行以下命令以生成预测结果: bash cd KIE-HVQA python3 infer_qwen.py ## 许可证 本项目源代码采用Apache License 2.0许可证开源。 本数据集采用[CC BY 4.0](https://creativecommons.org/licenses/by/4.0/)许可证发布。 ## 致谢 本数据集基于[OCRBench](https://github.com/Yuliang-Liu/MultimodalOCR)与[WildReceipt](https://arxiv.org/abs/2103.14470)构建。 ## 引用 若您在研究中使用本项目,请引用如下文献: bibtex @misc{he2025seeingbelievingmitigatingocr, title={Seeing is Believing? Mitigating OCR Hallucinations in Multimodal Large Language Models}, author={Zhentao He and Can Zhang and Ziheng Wu and Zhenghao Chen and Yufei Zhan and Yifan Li and Zhao Zhang and Xian Wang and Minghui Qiu}, year={2025}, eprint={2506.20168}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2506.20168}, }
提供机构:
maas
创建时间:
2025-08-25
搜集汇总
数据集介绍
main_image_url
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作