Dragongon/FinLFQA
收藏Hugging Face2026-04-09 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/Dragongon/FinLFQA
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mit
task_categories:
- question-answering
- text-generation
language:
- en
tags:
- finance
- long-form-qa
- attribution
- financial-analysis
- LLM-evaluation
pretty_name: FinLFQA
size_categories:
- 1K<n<10K
---
# FinLFQA
[**📖 Paper**](https://aclanthology.org/anthology-files/anthology-files/pdf/findings/2025.findings-emnlp.908.pdf) | [**💻 GitHub**](https://github.com/yitaoLong/FinLFQA)
The dataset for the paper [FinLFQA: Evaluating Attributed Text Generation of LLMs in Financial Long-Form Question Answering](https://aclanthology.org/anthology-files/anthology-files/pdf/findings/2025.findings-emnlp.908.pdf).
**FinLFQA** is a benchmark for evaluating the ability of large language models (LLMs) to generate long-form answers with fine-grained attributions in the financial domain. Unlike existing benchmarks that focus on short-form or extractive QA, FinLFQA requires models to synthesize information from multiple financial documents, apply professional financial knowledge, perform numerical reasoning, and provide clause-level attribution for every claim in their response.
<p align="center">
<img src="dataset.png" alt="FinLFQA Dataset Overview" width="800"/>
</p>
## Dataset Overview
FinLFQA contains **1,008** expert-annotated examples spanning diverse financial analysis topics. Each example requires the model to reason over financial filings from **two companies**, apply domain-specific knowledge (e.g., profitability ratios, DCF valuation, capital structure optimization), and produce a structured, multi-clause answer with fine-grained attribution.
| Split | # Samples |
|-------|-----------|
| Development | 302 |
| Test | 706 |
### Key Features
- **Cross-document reasoning**: Each question requires synthesizing information from two companies' financial filings.
- **Fine-grained attribution**: Answers are decomposed into clauses, each attributed to specific evidence paragraphs, professional knowledge, and/or computational code.
- **Numerical reasoning**: Answers involve quantitative calculations grounded in financial formulas and verifiable through executable Python code.
- **Professional knowledge grounding**: Each example includes a list of relevant financial formulas and domain knowledge used in reasoning.
## Loading the Dataset
```python
from datasets import load_dataset
import json
dataset = load_dataset("Dragongon/FinLFQA")
# Access the development set
dev_set = dataset["validation"]
print(f"Development set size: {len(dev_set)}")
# Access the test set
test_set = dataset["test"]
print(f"Test set size: {len(test_set)}")
# Print the first example
example = dev_set[0]
# The `context` and `clauses` fields are JSON-encoded strings; parse them as needed:
context = json.loads(example["context"])
clauses = json.loads(example["clauses"])
print(example["question"])
print("Companies:", list(context.keys()))
print("Number of clauses:", len(clauses))
```
## Data Format
Each example in the dataset contains the following fields:
```json
{
"id": "[int] Unique identifier for the example",
"question": "[string] The financial analysis question",
"answer": "[string] Expert-written long-form answer with inline attribution markers",
"topic": "[string] The financial analysis topic category",
"clauses": "[string] JSON-encoded list of decomposed answer clauses with fine-grained attribution",
"context": "[string] JSON-encoded dict of financial document paragraphs keyed by company ticker",
"professional knowledge list": "[list] Relevant financial formulas and domain knowledge",
"numerical_values": "[list] Key numerical values involved in the answer"
}
```
### Clause Structure
Each clause in the `clauses` field contains:
```json
{
"cid": "[int] Clause ID",
"clause": "[string] The claim text",
"inference": "[list] Indices of clauses this clause infers from",
"evidence": "[dict] Mapping from company ticker to paragraph indices used as evidence",
"professional knowledge": "[string] The financial formula or knowledge applied",
"code": "[string] Executable Python code for numerical verification",
"code_execution_result": "[string] Result of executing the code"
}
```
### Example
```json
{
"id": 0,
"question": "How does EBC's net interest income sensitivity compare between March 31, 2024, and December 31, 2023, when the interest rate change is +200 basis points?",
"answer": "EBC's net interest income sensitivity decreased by 0.2% {code: [0]} (2.9% - 3.1%) from December 31, 2023, to March 31, 2024. {evidence: EBC: [4], W: [], professional knowledge: [0]} ...",
"topic": "Cost of Capital Optimization Using Real Options Analysis",
"clauses": [
{
"cid": 0,
"clause": "EBC's net interest income sensitivity decreased by 0.2% (2.9% - 3.1%) from December 31, 2023, to March 31, 2024.",
"inference": [],
"evidence": {"EBC": [4], "W": []},
"professional knowledge": "Interest Rate Risk Analysis=Net Interest Margin (NIM) = (Interest Income - Interest Expense) / Average Earning Assets",
"code": "def calculate_net_interest_income_sensitivity_change(): ...",
"code_execution_result": "0.20000000000000018"
}
],
"context": {
"EBC": ["paragraph 1", "paragraph 2", "..."],
"W": ["paragraph 1", "paragraph 2", "..."]
},
"professional knowledge list": [
"Profitability Ratios=Net Profit Margin = (Net Income / Revenue) * 100",
"..."
],
"numerical_values": [0.2, 2.9, 3.1]
}
```
## Contact
For any issues or questions, kindly email us at: Yitao Long ([yitao.long@nyu.edu](mailto:yitao.long@nyu.edu)).
## Citation
```bibtex
@inproceedings{long-etal-2025-finlfqa,
title = "{F}in{LFQA}: Evaluating Attributed Text Generation of {LLM}s in Financial Long-Form Question Answering",
author = "Long, Yitao and
Hu, Tiansheng and
Zhao, Yilun and
Cohan, Arman and
Zhao, Chen",
editor = "Christodoulopoulos, Christos and
Chakraborty, Tanmoy and
Rose, Carolyn and
Peng, Violet",
booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2025",
month = nov,
year = "2025",
address = "Suzhou, China",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2025.findings-emnlp.908/",
doi = "10.18653/v1/2025.findings-emnlp.908",
pages = "16730--16750",
ISBN = "979-8-89176-335-7",
abstract = "Large Language Models (LLMs) frequently hallucinate to long-form questions, producing plausible yet factually incorrect answers. A common mitigation strategy is to provide attribution to LLM outputs. However, existing benchmarks primarily focus on simple attribution that retrieves supporting textual evidence as references. We argue that in real-world scenarios such as financial applications, attribution goes beyond reference retrieval.We introduce FinLFQA, a benchmark designed to evaluate the ability of LLMs to generate long-form answers to complex financial questions with reliable and nuanced attributions. FinLFQA evaluates three critical aspects of attribution through human annotations: (1) supporting evidence extracted from financial reports, (2) intermediate numerical reasoning steps, and (3) domain-specific financial knowledge that informs the reasoning process.We further provide an automatic evaluation framework covering both answer quality and attribution quality. Through extensive experiments on eight LLMs across multiple attribution-generation paradigms, we find that fine-grained metrics are important to distinguish model capabilities, that end-to-end generation achieves comparable performance to post-hoc approaches, and that iterative refinement only helps when guided by external feedback."
}
```
许可证:MIT
任务类别:
- 问答
- 文本生成
语言:
- 英语
标签:
- 金融
- 长格式问答(long-form QA)
- 归因
- 金融分析
- 大语言模型评估(LLM evaluation)
展示名称:FinLFQA
规模类别:1000 < 样本量 < 10000
---
# FinLFQA
[**📖 论文**](https://aclanthology.org/anthology-files/anthology-files/pdf/findings/2025.findings-emnlp.908.pdf) | [**💻 GitHub 仓库**](https://github.com/yitaoLong/FinLFQA)
本数据集对应论文《FinLFQA:面向金融长格式问答的大语言模型(Large Language Model,LLM)归因式文本生成能力评估》。
**FinLFQA** 是一款用于评估大语言模型(LLM)在金融领域生成带细粒度归因的长格式回答能力的基准测试集。与现有聚焦于短格式或抽取式问答(extractive QA)的基准测试不同,FinLFQA要求模型从多份金融文档中整合信息、运用专业金融知识、开展数值推理,并为回答中的每一项主张提供子句级归因(clause-level attribution)。
<p align="center">
<img src="dataset.png" alt="FinLFQA 数据集概览" width="800"/>
</p>
## 数据集概览
FinLFQA 包含 1008 条由专家标注的样本,涵盖多样化的金融分析主题。每条样本均要求模型基于两家公司的金融备案文件开展推理、运用领域专业知识(如盈利能力比率、贴现现金流(Discounted Cash Flow,DCF)估值、资本结构优化等),并生成带有细粒度归因的结构化多子句回答。
| 数据集划分 | 样本数量 |
|-----------|-----------|
| 开发集 | 302 |
| 测试集 | 706 |
### 核心特性
- **跨文档推理(cross-document reasoning)**:每个问题均需要整合两家公司的金融备案文件中的信息。
- **细粒度归因(fine-grained attribution)**:回答被拆分为多个子句,每个子句均对应特定的证据段落、专业知识和/或计算代码。
- **数值推理(numerical reasoning)**:回答涉及基于金融公式的定量计算,且可通过可执行的 Python 代码进行验证。
- **专业知识锚定(professional knowledge grounding)**:每条样本均包含推理过程中用到的相关金融公式与领域知识列表。
## 数据集加载
python
from datasets import load_dataset
import json
dataset = load_dataset("Dragongon/FinLFQA")
# 访问开发集
dev_set = dataset["validation"]
print(f"开发集规模:{len(dev_set)}")
# 访问测试集
test_set = dataset["test"]
print(f"测试集规模:{len(test_set)}")
# 打印第一条样本
example = dev_set[0]
# `context` 与 `clauses` 字段为 JSON 编码字符串,需按需解析:
context = json.loads(example["context"])
clauses = json.loads(example["clauses"])
print(example["question"])
print("涉及公司:", list(context.keys()))
print("子句数量:", len(clauses))
## 数据格式
数据集中的每条样本包含以下字段:
json
{
"id": "[int] 样本唯一标识符",
"question": "[string] 金融分析问题",
"answer": "[string] 专家撰写的带行内归因标记的长格式回答",
"topic": "[string] 金融分析主题类别",
"clauses": "[string] 带细粒度归因的拆解后回答子句的 JSON 编码列表",
"context": "[string] 以公司股票代码为键的金融文档段落的 JSON 编码字典",
"professional knowledge list": "[list] 相关金融公式与领域知识列表",
"numerical_values": "[list] 回答中涉及的关键数值"
}
### 子句结构
`clauses` 字段中的每个子句包含以下内容:
json
{
"cid": "[int] 子句 ID",
"clause": "[string] 主张文本",
"inference": "[list] 该子句所推导自的子句索引列表",
"evidence": "[dict] 从公司股票代码到用作证据的段落索引的映射字典",
"professional knowledge": "[string] 所应用的金融公式或知识",
"code": "[string] 用于数值验证的可执行 Python 代码",
"code_execution_result": "[string] 代码执行结果"
}
### 样本示例
json
{
"id": 0,
"question": "在利率变动+200个基点的情况下,EBC 2024年3月31日与2023年12月31日的净利息收入敏感性对比如何?",
"answer": "EBC的净利息收入敏感性在2023年12月31日至2024年3月31日期间下降了0.2% {code: [0]} (2.9% - 3.1%)。{evidence: EBC: [4], W: [], professional knowledge: [0]} ...",
"topic": "基于实物期权分析的资本成本优化",
"clauses": [
{
"cid": 0,
"clause": "EBC的净利息收入敏感性在2023年12月31日至2024年3月31日期间下降了0.2% (2.9% - 3.1%)。",
"inference": [],
"evidence": {"EBC": [4], "W": []},
"professional knowledge": "利率风险分析=净利息边际(NIM)=(利息收入-利息支出)/平均生息资产",
"code": "def calculate_net_interest_income_sensitivity_change(): ...",
"code_execution_result": "0.20000000000000018"
}
],
"context": {
"EBC": ["段落1", "段落2", "..."],
"W": ["段落1", "段落2", "..."]
},
"professional knowledge list": [
"盈利能力比率=净利润率=(净利润/营业收入)*100",
"..."
],
"numerical_values": [0.2, 2.9, 3.1]
}
## 联系方式
如有任何问题或疑问,请联系:龙一涛(Yitao Long),邮箱:yitao.long@nyu.edu。
## 引用
bibtex
@inproceedings{long-etal-2025-finlfqa,
title = "{F}in{LFQA}: Evaluating Attributed Text Generation of {LLM}s in Financial Long-Form Question Answering",
author = "Long, Yitao and
Hu, Tiansheng and
Zhao, Yilun and
Cohan, Arman and
Zhao, Chen",
editor = "Christodoulopoulos, Christos and
Chakraborty, Tanmoy and
Rose, Carolyn and
Peng, Violet",
booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2025",
month = nov,
year = "2025",
address = "Suzhou, China",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2025.findings-emnlp.908/",
doi = "10.18653/v1/2025.findings-emnlp.908",
pages = "16730--16750",
ISBN = "979-8-89176-335-7",
abstract = "Large Language Models (LLMs) frequently hallucinate to long-form questions, producing plausible yet factually incorrect answers. A common mitigation strategy is to provide attribution to LLM outputs. However, existing benchmarks primarily focus on simple attribution that retrieves supporting textual evidence as references. We argue that in real-world scenarios such as financial applications, attribution goes beyond reference retrieval.We introduce FinLFQA, a benchmark designed to evaluate the ability of LLMs to generate long-form answers to complex financial questions with reliable and nuanced attributions. FinLFQA evaluates three critical aspects of attribution through human annotations: (1) supporting evidence extracted from financial reports, (2) intermediate numerical reasoning steps, and (3) domain-specific financial knowledge that informs the reasoning process.We further provide an automatic evaluation framework covering both answer quality and attribution quality. Through extensive experiments on eight LLMs across multiple attribution-generation paradigms, we find that fine-grained metrics are important to distinguish model capabilities, that end-to-end generation achieves comparable performance to post-hoc approaches, and that iterative refinement only helps when guided by external feedback."
}
提供机构:
Dragongon



