five

FinanceRAG

收藏
魔搭社区2025-11-26 更新2025-02-15 收录
下载链接:
https://modelscope.cn/datasets/Linq-AI-Research/FinanceRAG
下载链接
链接失效反馈
官方服务:
资源简介:
# Dataset Card for FinanceRAG ## Dataset Summary The detailed description of dataset and reference will be added after the competition in [Kaggle/FinanceRAG Challenge](https://www.kaggle.com/competitions/icaif-24-finance-rag-challenge) ## Datasets ![Figure 1](https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F7863652%2Ffe9d7f029a218a75b06d4b866480655a%2Fimage.png?generation=1727154474415109&alt=media) 1. **Passage Retrieval**: - **FinDER**: Involves retrieving relevant sections from **10-K Reports** and financial disclosures based on **Search Queries** that simulate real-world questions asked by financial professionals, using domain-specific jargon and abbreviations. - **FinQABench**: Focuses on testing AI models' ability to answer **Search Queries** over **10-K Reports** with accuracy, evaluating the system's ability to detect hallucinations and ensure factual correctness in generated answers. - **FinanceBench**: Uses **Natural Queries** to retrieve relevant information from public filings like **10-K** and **Annual Reports**. The aim is to evaluate how well systems handle straightforward, real-world financial questions. 2. **Tabular and Text Retrieval**: - **TATQA**: Requires participants to answer **Natural Queries** that involve numerical reasoning over hybrid data, which combines tables and text from **Financial Reports**. Tasks include basic arithmetic, comparisons, and logical reasoning. - **FinQA**: Demands answering complex **Natural Queries** over **Earnings Reports** using multi-step numerical reasoning. Participants must accurately extract and calculate data from both textual and tabular sources. - **ConvFinQA**: Involves handling **Conversational Queries** where participants answer multi-turn questions based on **Earnings Reports**, maintaining context and accuracy across multiple interactions. - **MultiHiertt**: Focuses on **Multi-Hop Queries**, requiring participants to retrieve and reason over hierarchical tables and unstructured text from **Annual Reports**, making this one of the more complex reasoning tasks involving multiple steps across various document sections. ## Files For each dataset, you are provided with two files: * **corpus.jsonl** - This is a `JSONLines` file containing the context corpus. Each line in the file represents a single document in `JSON` format. * **queries.jsonl** - This is a `JSONLines` file containing the queries. Each line in this file represents one query in `JSON` format. Both files follow the jsonlines format, where each line corresponds to a separate data instance in `JSON` format. Here’s an expanded description including explanations for each line: - **_id**: A unique identifier for the context/query. - **title**: The title or headline of the context/query. - **text**: The full body of the document/query, containing the main content. ### How to Use The following code demonstrates how to load a specific subset (in this case, **FinDER**) from the **FinanceRAG** dataset on Hugging Face. In this example, we are loading the `corpus` split, which contains the document data relevant for financial analysis. The `load_dataset` function is used to retrieve the dataset, and a loop is set up to print the first document entry from the dataset, which includes fields like `_id`, `title`, and `text`. Each document provides detailed descriptions from financial reports, which participants can use for tasks such as retrieval and answering financial queries. ``` python from datasets import load_dataset # Loading a specific subset (i.e. FinDER) and a split (corpus, queries) dataset = load_dataset("Linq-AI-Research/FinanceRAG", "FinDER", split="corpus") for example in dataset: print(example) break ``` Here is an example result of `python` output of **FinDER** from the `corpus` split: ```json { '_id' : 'ADBE20230004', 'title': 'ADBE OVERVIEW', 'text': 'Adobe is a global technology company with a mission to change the world through personalized digital experiences...' } ```

# 数据集卡片:FinanceRAG ## 数据集概述 数据集详情与参考文献将在本次[Kaggle/FinanceRAG挑战赛](https://www.kaggle.com/competitions/icaif-24-finance-rag-challenge)结束后补充。 ## 数据集组成 ![图1](https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F7863652%2Ffe9d7f029a218a75b06d4b866480655a%2Fimage.png?generation=1727154474415109&alt=media) 1. **篇章检索(Passage Retrieval)** - **FinDER**:基于模拟金融从业者真实业务问题的**搜索查询(Search Queries)**,从**10-K报告(10-K Reports)**与金融披露文件中检索相关段落,查询内容包含领域专属术语与缩写。 - **FinQABench**:聚焦测试AI模型基于**10-K报告(10-K Reports)**回答**搜索查询(Search Queries)**的准确性,评估系统检测幻觉(hallucination)并确保生成答案事实正确性的能力。 - **FinanceBench**:使用**自然查询(Natural Queries)**从**10-K(10-K)**与**年度报告(Annual Reports)**等公开备案文件中检索相关信息,旨在评估系统处理真实世界基础金融问题的性能。 2. **表格与文本检索(Tabular and Text Retrieval)** - **TATQA**:要求参与者针对混合数据(结合财务报告中的表格与文本)完成数值推理类**自然查询(Natural Queries)**的作答,任务涵盖基础算术、比较与逻辑推理。 - **FinQA**:要求参与者基于**财报(Earnings Reports)**,通过多步数值推理完成复杂**自然查询(Natural Queries)**的作答,需准确从文本与表格数据源中提取并计算数据。 - **ConvFinQA**:处理**会话查询(Conversational Queries)**任务,参与者需基于**财报(Earnings Reports)**回答多轮问题,并在多轮交互中保持上下文一致性与作答准确性。 - **MultiHiertt**:聚焦**多跳查询(Multi-Hop Queries)**,要求参与者从**年度报告(Annual Reports)**的层级化表格与非结构化文本中进行信息检索与推理,该任务属于涉及多文档段落多步骤推理的复杂任务之一。 ## 数据集文件 针对每个数据集,均提供两类文件: * **corpus.jsonl**:该文件为JSONLines格式,包含上下文语料库。文件中每一行均为一条JSON格式的独立文档。 * **queries.jsonl**:该文件为JSONLines格式,包含查询语句。文件中每一行均为一条JSON格式的独立查询。 两类文件均遵循jsonlines格式规范,每一行对应一个独立的JSON格式数据实例。 以下为每一行数据的详细说明: - **_id**:上下文/查询的唯一标识符。 - **title**:上下文/查询的标题或主题行。 - **text**:文档/查询的完整正文,包含核心内容。 ### 使用方法 以下代码演示了如何从Hugging Face平台的FinanceRAG数据集中加载指定子集(此处以FinDER为例)。本示例加载的是`corpus`划分,该划分包含用于金融分析的文档数据。 我们将使用`load_dataset`函数获取数据集,并通过循环打印数据集中的首个文档条目,该条目包含`_id`、`title`与`text`等字段。 每个文档均提供了财务报告的详细描述,参与者可将其用于检索与回答金融查询等任务。 python from datasets import load_dataset # 加载指定子集(即FinDER)与划分(corpus、queries) dataset = load_dataset("Linq-AI-Research/FinanceRAG", "FinDER", split="corpus") for example in dataset: print(example) break 以下为从`corpus`划分中加载的FinDER数据集的Python输出示例: json { '_id' : 'ADBE20230004', 'title': 'ADBE OVERVIEW', 'text': 'Adobe is a global technology company with a mission to change the world through personalized digital experiences...' }
提供机构:
maas
创建时间:
2025-02-12
搜集汇总
数据集介绍
main_image_url
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作