five

PatronusAI/financebench

收藏
Hugging Face2024-11-17 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/PatronusAI/financebench
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-nc-4.0 --- [FinanceBench](https://hf.co/papers/2311.11944) is a first-of-its-kind test suite for evaluating the performance of LLMs on open book financial question answering (QA). This is an open source sample of 150 annotated examples used in the evaluation and analysis of models assessed in the FinanceBench paper. The PDFs linked in the dataset can be found here as well: [https://github.com/patronus-ai/financebench/tree/main/pdfs](https://github.com/patronus-ai/financebench/tree/main/pdfs) The dataset comprises of questions about publicly traded companies, with corresponding answers and evidence strings. The questions in FinanceBench are ecologically valid and cover a diverse set of scenarios. They are intended to be clear-cut and straightforward to answer to serve as a minimum performance standard. We test 16 state of the art model configurations (including GPT-4-Turbo, Llama2 and Claude2, with vector stores and long context prompts) on a sample of 150 cases from FinanceBench, and manually review their answers (n=2,400). The cases are available open-source. We find that existing LLMs have clear limitations for financial QA. All models assessed exhibit weaknesses, such as hallucinations, that limit their suitability for use by enterprises. To evaluate your models on the full dataset, or if you have questions about this work, you can email us at contact@patronus.ai
提供机构:
PatronusAI
原始信息汇总

数据集概述

数据集名称

FinanceBench

数据集描述

FinanceBench是一个用于评估大型语言模型(LLMs)在开放式财务问答(QA)性能的测试套件。该数据集包含150个标注样本,用于评估和分析FinanceBench论文中评估的模型。

数据集内容

  • 数据集包含关于公开交易公司的问题,以及相应的答案和证据字符串。
  • 问题设计为生态有效且覆盖多种场景,旨在作为最低性能标准。

模型评估

  • 对16种最先进的模型配置(包括GPT-4-Turbo, Llama2, 和Claude2,使用向量存储和长上下文提示)进行了测试。
  • 在150个案例上进行了手动审查,共计2,400个答案。

发现

  • 现有LLMs在财务QA方面存在明显局限性,所有评估模型均显示出弱点,如幻觉现象,限制了它们在企业中的应用。

联系方式

如需评估您的模型或对本工作有疑问,可通过contact@patronus.ai联系我们。

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作