five

financebench

收藏
魔搭社区2025-11-27 更新2025-05-24 收录
下载链接:
https://modelscope.cn/datasets/PatronusAI/financebench
下载链接
链接失效反馈
官方服务:
资源简介:
[FinanceBench](https://hf.co/papers/2311.11944) is a first-of-its-kind test suite for evaluating the performance of LLMs on open book financial question answering (QA). This is an open source sample of 150 annotated examples used in the evaluation and analysis of models assessed in the FinanceBench paper. The PDFs linked in the dataset can be found here as well: [https://github.com/patronus-ai/financebench/tree/main/pdfs](https://github.com/patronus-ai/financebench/tree/main/pdfs) The dataset comprises of questions about publicly traded companies, with corresponding answers and evidence strings. The questions in FinanceBench are ecologically valid and cover a diverse set of scenarios. They are intended to be clear-cut and straightforward to answer to serve as a minimum performance standard. We test 16 state of the art model configurations (including GPT-4-Turbo, Llama2 and Claude2, with vector stores and long context prompts) on a sample of 150 cases from FinanceBench, and manually review their answers (n=2,400). The cases are available open-source. We find that existing LLMs have clear limitations for financial QA. All models assessed exhibit weaknesses, such as hallucinations, that limit their suitability for use by enterprises. To evaluate your models on the full dataset, or if you have questions about this work, you can email us at contact@patronus.ai

[FinanceBench](https://hf.co/papers/2311.11944) 是全球首个用于评估大语言模型(Large Language Model,LLM)开卷式金融问答(Question Answering,QA)性能的测试套件。本数据集为FinanceBench论文中用于评估与分析受试模型的150个带标注示例的开源样本。 数据集中附带的PDF文件亦可在此处获取:[https://github.com/patronus-ai/financebench/tree/main/pdfs](https://github.com/patronus-ai/financebench/tree/main/pdfs) 该数据集包含针对上市公司的问题,以及对应的答案与证据字符串。FinanceBench中的问题均具备良好生态效度,涵盖多样化的应用场景;其设计旨在简洁明确、易于作答,以此作为模型性能的最低基准。 我们基于FinanceBench的150个案例样本,对16种当前前沿的模型配置(含GPT-4-Turbo、Llama2、Claude2,搭配向量存储与长上下文提示)开展了测试,并对总计2400份模型输出答案进行了人工复核。该案例样本均以开源形式发布。 研究发现,现有大语言模型在金融问答任务中存在显著局限:所有参与评估的模型均存在诸如幻觉生成等缺陷,这限制了其在企业场景中的应用适配性。 若需在完整数据集上评估您的模型,或对本研究存在疑问,可发送邮件至contact@patronus.ai与我们取得联系。
提供机构:
maas
创建时间:
2025-05-20
搜集汇总
数据集介绍
main_image_url
背景与挑战
背景概述
FinanceBench是一个专为评估大型语言模型在开放书籍金融问答任务性能而设计的测试套件,包含150个带注释的示例,涵盖上市公司相关问题,并提供答案和证据字符串。数据集具有生态有效性,覆盖多样化场景,旨在作为最低性能标准,但测试显示现有模型存在幻觉等限制,影响其在企业应用中的适用性。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作