five

Dragongon/FinLFQA

收藏
Hugging Face2026-04-09 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/Dragongon/FinLFQA
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: mit task_categories: - question-answering - text-generation language: - en tags: - finance - long-form-qa - attribution - financial-analysis - LLM-evaluation pretty_name: FinLFQA size_categories: - 1K<n<10K --- # FinLFQA [**📖 Paper**](https://aclanthology.org/anthology-files/anthology-files/pdf/findings/2025.findings-emnlp.908.pdf) | [**💻 GitHub**](https://github.com/yitaoLong/FinLFQA) The dataset for the paper [FinLFQA: Evaluating Attributed Text Generation of LLMs in Financial Long-Form Question Answering](https://aclanthology.org/anthology-files/anthology-files/pdf/findings/2025.findings-emnlp.908.pdf). **FinLFQA** is a benchmark for evaluating the ability of large language models (LLMs) to generate long-form answers with fine-grained attributions in the financial domain. Unlike existing benchmarks that focus on short-form or extractive QA, FinLFQA requires models to synthesize information from multiple financial documents, apply professional financial knowledge, perform numerical reasoning, and provide clause-level attribution for every claim in their response. <p align="center"> <img src="dataset.png" alt="FinLFQA Dataset Overview" width="800"/> </p> ## Dataset Overview FinLFQA contains **1,008** expert-annotated examples spanning diverse financial analysis topics. Each example requires the model to reason over financial filings from **two companies**, apply domain-specific knowledge (e.g., profitability ratios, DCF valuation, capital structure optimization), and produce a structured, multi-clause answer with fine-grained attribution. | Split | # Samples | |-------|-----------| | Development | 302 | | Test | 706 | ### Key Features - **Cross-document reasoning**: Each question requires synthesizing information from two companies' financial filings. - **Fine-grained attribution**: Answers are decomposed into clauses, each attributed to specific evidence paragraphs, professional knowledge, and/or computational code. - **Numerical reasoning**: Answers involve quantitative calculations grounded in financial formulas and verifiable through executable Python code. - **Professional knowledge grounding**: Each example includes a list of relevant financial formulas and domain knowledge used in reasoning. ## Loading the Dataset ```python from datasets import load_dataset import json dataset = load_dataset("Dragongon/FinLFQA") # Access the development set dev_set = dataset["validation"] print(f"Development set size: {len(dev_set)}") # Access the test set test_set = dataset["test"] print(f"Test set size: {len(test_set)}") # Print the first example example = dev_set[0] # The `context` and `clauses` fields are JSON-encoded strings; parse them as needed: context = json.loads(example["context"]) clauses = json.loads(example["clauses"]) print(example["question"]) print("Companies:", list(context.keys())) print("Number of clauses:", len(clauses)) ``` ## Data Format Each example in the dataset contains the following fields: ```json { "id": "[int] Unique identifier for the example", "question": "[string] The financial analysis question", "answer": "[string] Expert-written long-form answer with inline attribution markers", "topic": "[string] The financial analysis topic category", "clauses": "[string] JSON-encoded list of decomposed answer clauses with fine-grained attribution", "context": "[string] JSON-encoded dict of financial document paragraphs keyed by company ticker", "professional knowledge list": "[list] Relevant financial formulas and domain knowledge", "numerical_values": "[list] Key numerical values involved in the answer" } ``` ### Clause Structure Each clause in the `clauses` field contains: ```json { "cid": "[int] Clause ID", "clause": "[string] The claim text", "inference": "[list] Indices of clauses this clause infers from", "evidence": "[dict] Mapping from company ticker to paragraph indices used as evidence", "professional knowledge": "[string] The financial formula or knowledge applied", "code": "[string] Executable Python code for numerical verification", "code_execution_result": "[string] Result of executing the code" } ``` ### Example ```json { "id": 0, "question": "How does EBC's net interest income sensitivity compare between March 31, 2024, and December 31, 2023, when the interest rate change is +200 basis points?", "answer": "EBC's net interest income sensitivity decreased by 0.2% {code: [0]} (2.9% - 3.1%) from December 31, 2023, to March 31, 2024. {evidence: EBC: [4], W: [], professional knowledge: [0]} ...", "topic": "Cost of Capital Optimization Using Real Options Analysis", "clauses": [ { "cid": 0, "clause": "EBC's net interest income sensitivity decreased by 0.2% (2.9% - 3.1%) from December 31, 2023, to March 31, 2024.", "inference": [], "evidence": {"EBC": [4], "W": []}, "professional knowledge": "Interest Rate Risk Analysis=Net Interest Margin (NIM) = (Interest Income - Interest Expense) / Average Earning Assets", "code": "def calculate_net_interest_income_sensitivity_change(): ...", "code_execution_result": "0.20000000000000018" } ], "context": { "EBC": ["paragraph 1", "paragraph 2", "..."], "W": ["paragraph 1", "paragraph 2", "..."] }, "professional knowledge list": [ "Profitability Ratios=Net Profit Margin = (Net Income / Revenue) * 100", "..." ], "numerical_values": [0.2, 2.9, 3.1] } ``` ## Contact For any issues or questions, kindly email us at: Yitao Long ([yitao.long@nyu.edu](mailto:yitao.long@nyu.edu)). ## Citation ```bibtex @inproceedings{long-etal-2025-finlfqa, title = "{F}in{LFQA}: Evaluating Attributed Text Generation of {LLM}s in Financial Long-Form Question Answering", author = "Long, Yitao and Hu, Tiansheng and Zhao, Yilun and Cohan, Arman and Zhao, Chen", editor = "Christodoulopoulos, Christos and Chakraborty, Tanmoy and Rose, Carolyn and Peng, Violet", booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2025", month = nov, year = "2025", address = "Suzhou, China", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2025.findings-emnlp.908/", doi = "10.18653/v1/2025.findings-emnlp.908", pages = "16730--16750", ISBN = "979-8-89176-335-7", abstract = "Large Language Models (LLMs) frequently hallucinate to long-form questions, producing plausible yet factually incorrect answers. A common mitigation strategy is to provide attribution to LLM outputs. However, existing benchmarks primarily focus on simple attribution that retrieves supporting textual evidence as references. We argue that in real-world scenarios such as financial applications, attribution goes beyond reference retrieval.We introduce FinLFQA, a benchmark designed to evaluate the ability of LLMs to generate long-form answers to complex financial questions with reliable and nuanced attributions. FinLFQA evaluates three critical aspects of attribution through human annotations: (1) supporting evidence extracted from financial reports, (2) intermediate numerical reasoning steps, and (3) domain-specific financial knowledge that informs the reasoning process.We further provide an automatic evaluation framework covering both answer quality and attribution quality. Through extensive experiments on eight LLMs across multiple attribution-generation paradigms, we find that fine-grained metrics are important to distinguish model capabilities, that end-to-end generation achieves comparable performance to post-hoc approaches, and that iterative refinement only helps when guided by external feedback." } ```

许可证:MIT 任务类别: - 问答 - 文本生成 语言: - 英语 标签: - 金融 - 长格式问答(long-form QA) - 归因 - 金融分析 - 大语言模型评估(LLM evaluation) 展示名称:FinLFQA 规模类别:1000 < 样本量 < 10000 --- # FinLFQA [**📖 论文**](https://aclanthology.org/anthology-files/anthology-files/pdf/findings/2025.findings-emnlp.908.pdf) | [**💻 GitHub 仓库**](https://github.com/yitaoLong/FinLFQA) 本数据集对应论文《FinLFQA:面向金融长格式问答的大语言模型(Large Language Model,LLM)归因式文本生成能力评估》。 **FinLFQA** 是一款用于评估大语言模型(LLM)在金融领域生成带细粒度归因的长格式回答能力的基准测试集。与现有聚焦于短格式或抽取式问答(extractive QA)的基准测试不同,FinLFQA要求模型从多份金融文档中整合信息、运用专业金融知识、开展数值推理,并为回答中的每一项主张提供子句级归因(clause-level attribution)。 <p align="center"> <img src="dataset.png" alt="FinLFQA 数据集概览" width="800"/> </p> ## 数据集概览 FinLFQA 包含 1008 条由专家标注的样本,涵盖多样化的金融分析主题。每条样本均要求模型基于两家公司的金融备案文件开展推理、运用领域专业知识(如盈利能力比率、贴现现金流(Discounted Cash Flow,DCF)估值、资本结构优化等),并生成带有细粒度归因的结构化多子句回答。 | 数据集划分 | 样本数量 | |-----------|-----------| | 开发集 | 302 | | 测试集 | 706 | ### 核心特性 - **跨文档推理(cross-document reasoning)**:每个问题均需要整合两家公司的金融备案文件中的信息。 - **细粒度归因(fine-grained attribution)**:回答被拆分为多个子句,每个子句均对应特定的证据段落、专业知识和/或计算代码。 - **数值推理(numerical reasoning)**:回答涉及基于金融公式的定量计算,且可通过可执行的 Python 代码进行验证。 - **专业知识锚定(professional knowledge grounding)**:每条样本均包含推理过程中用到的相关金融公式与领域知识列表。 ## 数据集加载 python from datasets import load_dataset import json dataset = load_dataset("Dragongon/FinLFQA") # 访问开发集 dev_set = dataset["validation"] print(f"开发集规模:{len(dev_set)}") # 访问测试集 test_set = dataset["test"] print(f"测试集规模:{len(test_set)}") # 打印第一条样本 example = dev_set[0] # `context` 与 `clauses` 字段为 JSON 编码字符串,需按需解析: context = json.loads(example["context"]) clauses = json.loads(example["clauses"]) print(example["question"]) print("涉及公司:", list(context.keys())) print("子句数量:", len(clauses)) ## 数据格式 数据集中的每条样本包含以下字段: json { "id": "[int] 样本唯一标识符", "question": "[string] 金融分析问题", "answer": "[string] 专家撰写的带行内归因标记的长格式回答", "topic": "[string] 金融分析主题类别", "clauses": "[string] 带细粒度归因的拆解后回答子句的 JSON 编码列表", "context": "[string] 以公司股票代码为键的金融文档段落的 JSON 编码字典", "professional knowledge list": "[list] 相关金融公式与领域知识列表", "numerical_values": "[list] 回答中涉及的关键数值" } ### 子句结构 `clauses` 字段中的每个子句包含以下内容: json { "cid": "[int] 子句 ID", "clause": "[string] 主张文本", "inference": "[list] 该子句所推导自的子句索引列表", "evidence": "[dict] 从公司股票代码到用作证据的段落索引的映射字典", "professional knowledge": "[string] 所应用的金融公式或知识", "code": "[string] 用于数值验证的可执行 Python 代码", "code_execution_result": "[string] 代码执行结果" } ### 样本示例 json { "id": 0, "question": "在利率变动+200个基点的情况下,EBC 2024年3月31日与2023年12月31日的净利息收入敏感性对比如何?", "answer": "EBC的净利息收入敏感性在2023年12月31日至2024年3月31日期间下降了0.2% {code: [0]} (2.9% - 3.1%)。{evidence: EBC: [4], W: [], professional knowledge: [0]} ...", "topic": "基于实物期权分析的资本成本优化", "clauses": [ { "cid": 0, "clause": "EBC的净利息收入敏感性在2023年12月31日至2024年3月31日期间下降了0.2% (2.9% - 3.1%)。", "inference": [], "evidence": {"EBC": [4], "W": []}, "professional knowledge": "利率风险分析=净利息边际(NIM)=(利息收入-利息支出)/平均生息资产", "code": "def calculate_net_interest_income_sensitivity_change(): ...", "code_execution_result": "0.20000000000000018" } ], "context": { "EBC": ["段落1", "段落2", "..."], "W": ["段落1", "段落2", "..."] }, "professional knowledge list": [ "盈利能力比率=净利润率=(净利润/营业收入)*100", "..." ], "numerical_values": [0.2, 2.9, 3.1] } ## 联系方式 如有任何问题或疑问,请联系:龙一涛(Yitao Long),邮箱:yitao.long@nyu.edu。 ## 引用 bibtex @inproceedings{long-etal-2025-finlfqa, title = "{F}in{LFQA}: Evaluating Attributed Text Generation of {LLM}s in Financial Long-Form Question Answering", author = "Long, Yitao and Hu, Tiansheng and Zhao, Yilun and Cohan, Arman and Zhao, Chen", editor = "Christodoulopoulos, Christos and Chakraborty, Tanmoy and Rose, Carolyn and Peng, Violet", booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2025", month = nov, year = "2025", address = "Suzhou, China", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2025.findings-emnlp.908/", doi = "10.18653/v1/2025.findings-emnlp.908", pages = "16730--16750", ISBN = "979-8-89176-335-7", abstract = "Large Language Models (LLMs) frequently hallucinate to long-form questions, producing plausible yet factually incorrect answers. A common mitigation strategy is to provide attribution to LLM outputs. However, existing benchmarks primarily focus on simple attribution that retrieves supporting textual evidence as references. We argue that in real-world scenarios such as financial applications, attribution goes beyond reference retrieval.We introduce FinLFQA, a benchmark designed to evaluate the ability of LLMs to generate long-form answers to complex financial questions with reliable and nuanced attributions. FinLFQA evaluates three critical aspects of attribution through human annotations: (1) supporting evidence extracted from financial reports, (2) intermediate numerical reasoning steps, and (3) domain-specific financial knowledge that informs the reasoning process.We further provide an automatic evaluation framework covering both answer quality and attribution quality. Through extensive experiments on eight LLMs across multiple attribution-generation paradigms, we find that fine-grained metrics are important to distinguish model capabilities, that end-to-end generation achieves comparable performance to post-hoc approaches, and that iterative refinement only helps when guided by external feedback." }
提供机构:
Dragongon
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作