FinanceRAG

Name: FinanceRAG
Creator: maas
Published: 2025-11-26 19:25:20
License: 暂无描述

魔搭社区2025-11-26 更新2025-02-15 收录

下载链接：

https://modelscope.cn/datasets/Linq-AI-Research/FinanceRAG

下载链接

链接失效反馈

官方服务：

资源简介：

# Dataset Card for FinanceRAG ## Dataset Summary The detailed description of dataset and reference will be added after the competition in [Kaggle/FinanceRAG Challenge](https://www.kaggle.com/competitions/icaif-24-finance-rag-challenge) ## Datasets ![Figure 1](https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F7863652%2Ffe9d7f029a218a75b06d4b866480655a%2Fimage.png?generation=1727154474415109&alt=media) 1. **Passage Retrieval**: - **FinDER**: Involves retrieving relevant sections from **10-K Reports** and financial disclosures based on **Search Queries** that simulate real-world questions asked by financial professionals, using domain-specific jargon and abbreviations. - **FinQABench**: Focuses on testing AI models' ability to answer **Search Queries** over **10-K Reports** with accuracy, evaluating the system's ability to detect hallucinations and ensure factual correctness in generated answers. - **FinanceBench**: Uses **Natural Queries** to retrieve relevant information from public filings like **10-K** and **Annual Reports**. The aim is to evaluate how well systems handle straightforward, real-world financial questions. 2. **Tabular and Text Retrieval**: - **TATQA**: Requires participants to answer **Natural Queries** that involve numerical reasoning over hybrid data, which combines tables and text from **Financial Reports**. Tasks include basic arithmetic, comparisons, and logical reasoning. - **FinQA**: Demands answering complex **Natural Queries** over **Earnings Reports** using multi-step numerical reasoning. Participants must accurately extract and calculate data from both textual and tabular sources. - **ConvFinQA**: Involves handling **Conversational Queries** where participants answer multi-turn questions based on **Earnings Reports**, maintaining context and accuracy across multiple interactions. - **MultiHiertt**: Focuses on **Multi-Hop Queries**, requiring participants to retrieve and reason over hierarchical tables and unstructured text from **Annual Reports**, making this one of the more complex reasoning tasks involving multiple steps across various document sections. ## Files For each dataset, you are provided with two files: * **corpus.jsonl** - This is a `JSONLines` file containing the context corpus. Each line in the file represents a single document in `JSON` format. * **queries.jsonl** - This is a `JSONLines` file containing the queries. Each line in this file represents one query in `JSON` format. Both files follow the jsonlines format, where each line corresponds to a separate data instance in `JSON` format. Here’s an expanded description including explanations for each line: - **_id**: A unique identifier for the context/query. - **title**: The title or headline of the context/query. - **text**: The full body of the document/query, containing the main content. ### How to Use The following code demonstrates how to load a specific subset (in this case, **FinDER**) from the **FinanceRAG** dataset on Hugging Face. In this example, we are loading the `corpus` split, which contains the document data relevant for financial analysis. The `load_dataset` function is used to retrieve the dataset, and a loop is set up to print the first document entry from the dataset, which includes fields like `_id`, `title`, and `text`. Each document provides detailed descriptions from financial reports, which participants can use for tasks such as retrieval and answering financial queries. ``` python from datasets import load_dataset # Loading a specific subset (i.e. FinDER) and a split (corpus, queries) dataset = load_dataset("Linq-AI-Research/FinanceRAG", "FinDER", split="corpus") for example in dataset: print(example) break ``` Here is an example result of `python` output of **FinDER** from the `corpus` split: ```json { '_id' : 'ADBE20230004', 'title': 'ADBE OVERVIEW', 'text': 'Adobe is a global technology company with a mission to change the world through personalized digital experiences...' } ```

# 数据集卡片：FinanceRAG ## 数据集概述数据集详情与参考文献将在本次[Kaggle/FinanceRAG挑战赛](https://www.kaggle.com/competitions/icaif-24-finance-rag-challenge)结束后补充。 ## 数据集组成 ![图1](https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F7863652%2Ffe9d7f029a218a75b06d4b866480655a%2Fimage.png?generation=1727154474415109&alt=media) 1. **篇章检索（Passage Retrieval）** - **FinDER**：基于模拟金融从业者真实业务问题的**搜索查询（Search Queries）**，从**10-K报告（10-K Reports）**与金融披露文件中检索相关段落，查询内容包含领域专属术语与缩写。 - **FinQABench**：聚焦测试AI模型基于**10-K报告（10-K Reports）**回答**搜索查询（Search Queries）**的准确性，评估系统检测幻觉（hallucination）并确保生成答案事实正确性的能力。 - **FinanceBench**：使用**自然查询（Natural Queries）**从**10-K（10-K）**与**年度报告（Annual Reports）**等公开备案文件中检索相关信息，旨在评估系统处理真实世界基础金融问题的性能。 2. **表格与文本检索（Tabular and Text Retrieval）** - **TATQA**：要求参与者针对混合数据（结合财务报告中的表格与文本）完成数值推理类**自然查询（Natural Queries）**的作答，任务涵盖基础算术、比较与逻辑推理。 - **FinQA**：要求参与者基于**财报（Earnings Reports）**，通过多步数值推理完成复杂**自然查询（Natural Queries）**的作答，需准确从文本与表格数据源中提取并计算数据。 - **ConvFinQA**：处理**会话查询（Conversational Queries）**任务，参与者需基于**财报（Earnings Reports）**回答多轮问题，并在多轮交互中保持上下文一致性与作答准确性。 - **MultiHiertt**：聚焦**多跳查询（Multi-Hop Queries）**，要求参与者从**年度报告（Annual Reports）**的层级化表格与非结构化文本中进行信息检索与推理，该任务属于涉及多文档段落多步骤推理的复杂任务之一。 ## 数据集文件针对每个数据集，均提供两类文件： * **corpus.jsonl**：该文件为JSONLines格式，包含上下文语料库。文件中每一行均为一条JSON格式的独立文档。 * **queries.jsonl**：该文件为JSONLines格式，包含查询语句。文件中每一行均为一条JSON格式的独立查询。两类文件均遵循jsonlines格式规范，每一行对应一个独立的JSON格式数据实例。以下为每一行数据的详细说明： - **_id**：上下文/查询的唯一标识符。 - **title**：上下文/查询的标题或主题行。 - **text**：文档/查询的完整正文，包含核心内容。 ### 使用方法以下代码演示了如何从Hugging Face平台的FinanceRAG数据集中加载指定子集（此处以FinDER为例）。本示例加载的是`corpus`划分，该划分包含用于金融分析的文档数据。我们将使用`load_dataset`函数获取数据集，并通过循环打印数据集中的首个文档条目，该条目包含`_id`、`title`与`text`等字段。每个文档均提供了财务报告的详细描述，参与者可将其用于检索与回答金融查询等任务。 python from datasets import load_dataset # 加载指定子集（即FinDER）与划分（corpus、queries） dataset = load_dataset("Linq-AI-Research/FinanceRAG", "FinDER", split="corpus") for example in dataset: print(example) break 以下为从`corpus`划分中加载的FinDER数据集的Python输出示例： json { '_id' : 'ADBE20230004', 'title': 'ADBE OVERVIEW', 'text': 'Adobe is a global technology company with a mission to change the world through personalized digital experiences...' }

提供机构：

maas

创建时间：

2025-02-12

搜集汇总

数据集介绍

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集