sujet-ai/Sujet-Financial-RAG-FR-Dataset
收藏Hugging Face2024-07-28 更新2025-04-12 收录
下载链接:
https://hf-mirror.com/datasets/sujet-ai/Sujet-Financial-RAG-FR-Dataset
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: question
dtype: string
- name: context
dtype: string
splits:
- name: train
num_bytes: 67025771
num_examples: 28880
- name: test
num_bytes: 2817295
num_examples: 1209
download_size: 3107384
dataset_size: 69843066
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
- split: test
path: data/test-*
license: mit
language:
- fr
pretty_name: F
tags:
- finance
- financial embedding
- financial qa
- financial question answer
- financial rag
- embedding model finetuning
---
# Sujet-Financial-RAG-FR-Dataset 📊💼
## Description 📝
This dataset is a proof-of-concept collection of French question-context pairs, specifically designed for training and evaluating embedding models in the financial domain. To demonstrate the importance of this approach, we hand-selected a few publicly available French financial documents. It's important to note that it remains entirely possible and fairly straightforward to gather a lot more financial documents and generate more questions per chunk in order to create much bigger and richer datasets!
This dataset was utilized to fine-tune the embedding models [sujet-ai/Marsilia-Embeddings-FR-Base](https://huggingface.co/sujet-ai/Marsilia-Embeddings-FR-Base) and [sujet-ai/Marsilia-Embeddings-FR-Large](https://huggingface.co/sujet-ai/Marsilia-Embeddings-FR-Large), demonstrating the critical importance of fine-tuning open-source models for deploying high-performance RAG (Retrieval-Augmented Generation) applications.
## Dataset Content 📊
- **Total Samples**: 30,009
- Training Set: 28,880 pairs
- Test Set: 1,209 pairs
- **Columns**:
- `question`: A generated financial question
- `context`: The corresponding context where the answer can be found
## Creation Methodology 🛠️
1. **Data Collection**: Financial reports, press releases, and official documents from various French companies and institutions were carefully selected.
2. **Preprocessing**: PDF documents were converted to text and split into chunks.
3. **Question Generation**: For each valid chunk, 20 financial questions were generated using the GPT-4o-mini model, employing a specialized prompt.
4. **Post-processing**: Questions generated from empty or invalid chunks were removed.
### Question Generation Prompt 🤖
The following prompt (in French) was used with GPT-4o-mini to generate questions for each chunk:
```
Les informations contextuelles sont ci-dessous.
---------------------
{context_str}
---------------------
Étant donné les informations contextuelles et non des connaissances antérieures,
générez uniquement des questions financières de haute qualité basées sur la requête ci-dessous.
Vous êtes un professeur spécialiste en finance. Votre tâche est de préparer \
{num_questions_per_chunk} questions pour un prochain \
quiz/examen axé sur des sujets financiers. Les questions doivent être \
variées et couvrir divers aspects de la finance, tels que \
la comptabilité, l'investissement, l'analyse de marché et les régulations financières, \
dans tout le document. Limitez les questions aux \
informations contextuelles fournies.
```
## Intended Use 🎯
This dataset is designed for:
- Fine-tuning embedding models for French financial RAG tasks
- Evaluating embedding model performance in financial contexts
- Serving as a foundation for developing financial question-answering systems
## Loading the Dataset 💻
To load and explore the dataset, you can use the following Python code:
```python
from datasets import load_dataset
def load_and_print_dataset_info(dataset_name):
dataset = load_dataset(dataset_name)
print(f"\nDataset: {dataset_name}")
print(f"Number of train examples: {len(dataset['train'])}")
print(f"Number of test examples: {len(dataset['test'])}")
print("Sample from train set:")
print(dataset['train'][0])
print("\nSample from test set:")
print(dataset['test'][0])
return dataset
# Load and print info for French dataset
fr = load_and_print_dataset_info("sujet-ai/Sujet-Financial-RAG-FR-Dataset")
```
## Data Sources 📚
### Training Set
1. [Air France-KLM - 2023 Results](https://www.airfranceklm.com/sites/default/files/2024-02/20240228_-_q4_fy_2023_results_-_afklm_-_press_release_fr_0.pdf)
2. [Allianz Home - Annual Report 2022](https://francescpi.com/scpi-de-rendement/allianz-home/ra/allianz-home-rapport-annuel-2022.pdf)
3. [Airbus - Annual Results 2023](https://www.airbus.com/sites/g/files/jlcbta136/files/2024-02/FR-Press-Release-Airbus-FY2023-Results.pdf)
4. [BPCE Group - Q1 2024 Results](https://newsroom.groupebpce.fr/assets/cp-resultats-groupe-bpce-t1-24-vf-pdf-9a19-7b707.html)
5. [BNP Paribas - Annual Results 2023](https://cdn-group.bnpparibas.com/uploads/file/CP_BNPP_R%C3%A9sultats_Annuels_2023_FR.pdf)
6. [EDF - Activity Report 2023](https://www.edf.fr/sites/groupe/files/2024-03/edf-resultats-annuels-2023-rapport-activite-2024-03-01.pdf)
7. [HCSF - Annual Report 2023](https://www.economie.gouv.fr/files/files/directions_services/hcsf/HCSF_Rapport_annuel_2023.pdf?v=1698223265)
8. [HSBC France - Annual Financial Report 2022](https://www.about.hsbc.fr/-/media/france/fr/investors-relations/hsbc-sfh/230307-rapport-financier-annuel-2022.pdf)
9. [La Poste Group - 2023 Results](https://le-groupe-laposte.cdn.prismic.io/le-groupe-laposte/e6f6d760-3c9d-4324-9c6f-5cf7ff4235a3_Communique+de+presse+des+resultats+2023+du+groupe+La+Poste.pdf)
10. [Ministry of Economy - APE Financial Report 2020-2021](https://www.economie.gouv.fr/files/2021-10/Rapport%20financier-APE-2021.pdf)
11. [Orange Bank - Financial Report 2023](https://www.orangebank.fr/dam/jcr:9a801d81-9f09-4b73-8771-06966315b5be/OB%202023%20-%20Rapport%20financier%20v2024-06-20.pdf)
12. [Renault Group - Consolidated Accounts 2023](https://www.renaultgroup.com/wp-content/uploads/2024/02/2023.12-comptes-consolides-2023-1.pdf)
13. [Société Générale SCF - Annual Financial Report 2021](https://www.societegenerale.com/sites/default/files/documents/2022-03/sg-scf-rapport-financier-annuel-2021.pdf)
14. [Société Générale SFH - Annual Financial Report 2023](https://www.societegenerale.com/sites/default/files/documents/2024-03/societe-generale-sfh-rapport-financier-annuel-2023.pdf)
15. [Vivendi - Financial Report and Consolidated Financial Statements 2022](https://www.vivendi.com/wp-content/uploads/2023/03/20230308_VIV_Rapport-financier-et-Etats-financiers-consolides-de-lexercice-2022.pdf)
### Test Set
1. [Société Générale - Q1 2024 Results](https://www.societegenerale.com/sites/default/files/resultats_publication/fr/2024-05/t1-2024-Communique-presse_FR.pdf)
2. [BNP Paribas - Q1 2024 Results](https://cdn-group.bnpparibas.com/uploads/file/CP_BNPP_R%C3%A9sultats_1T-2024_FR.pdf)
## Ethical Considerations 🤔
Users of this dataset should be aware that:
- The data comes from public documents, but its use must respect the copyright and terms of use of the original sources.
- The content reflects the financial information available at the time of dataset creation and may not represent current financial situations.
- AI-generated questions may contain biases or inaccuracies inherent to the generation process.
## Future Work 🔮
- Expansion of the dataset with more diverse sources
- Regular updates with the latest financial reports
- Creation of specialized subsets for specific financial sectors
- Increasing the number of questions generated per chunk to create a larger, more comprehensive dataset
---
If you use this dataset in your research or applications, please cite it as:
```
@software{Sujet-Financial-RAG-FR-Dataset,
author = {Sujet AI, Allaa Boutaleb, Hamed Rahimi},
title = {Sujet-Financial-RAG-FR-Dataset: A synthetically generated french financial QA dataset to finetune embedding models},
year = {2024},
url = {https://huggingface.co/datasets/sujet-ai/Sujet-Financial-RAG-FR-Dataset}
}
```
For questions, feedback, or collaborations, please reach out to us on [LinkedIn](https://www.linkedin.com/company/sujet-ai/) or visit our website [https://sujet.ai](https://sujet.ai)
# Sujet-Financial-RAG-FR数据集 📊💼
## 数据集信息
特征:
- 名称:question,数据类型:字符串
- 名称:context,数据类型:字符串
划分:
- 名称:训练集(train),字节数:67025771,样本数:28880
- 名称:测试集(test),字节数:2817295,样本数:1209
下载大小:3107384,数据集总大小:69843066
配置:
- 配置名称:default,数据文件:
- 划分:训练集,路径:data/train-*
- 划分:测试集,路径:data/test-*
许可证:MIT
语言:法语(fr)
规范名称:F
标签:
- 金融(finance)
- 金融嵌入(financial embedding)
- 金融问答(financial qa)
- 金融问答(financial question answer)
- 金融检索增强生成(financial rag,Retrieval-Augmented Generation,RAG)
- 嵌入模型微调(embedding model finetuning)
## 描述 📝
本数据集为法语问答-上下文配对的概念验证集合,专为金融领域嵌入模型(embedding model)的训练与评估设计。为验证该方法的价值,我们手工遴选了若干公开可用的法语金融文档。需特别说明的是,若需构建规模更大、内容更丰富的数据集,仅需收集更多金融文档,并为每个文本块生成更多问答配对即可实现,该过程简便易行。
本数据集已用于微调嵌入模型[sujet-ai/Marsilia-Embeddings-FR-Base](https://huggingface.co/sujet-ai/Marsilia-Embeddings-FR-Base)与[sujet-ai/Marsilia-Embeddings-FR-Large](https://huggingface.co/sujet-ai/Marsilia-Embeddings-FR-Large),验证了针对开源模型进行微调,对部署高性能检索增强生成(Retrieval-Augmented Generation,RAG)应用的关键意义。
## 数据集内容 📊
- **总样本量**:30009
- 训练集:28880组问答配对
- 测试集:1209组问答配对
- **字段说明**:
- `question`:生成式金融问题
- `context`:包含该问题答案的对应上下文文本
## 构建方法 🛠️
1. **数据采集**:精心遴选来自多家法国企业与机构的金融报告、新闻稿及官方文档。
2. **预处理**:将PDF文档转换为纯文本,并拆分为多个文本块。
3. **问题生成**:针对每个有效文本块,使用GPT-4o-mini模型结合专用提示词生成20个金融问题。
4. **后处理**:移除从空文本或无效文本块中生成的问题。
### 问题生成提示词 🤖
以下为针对每个文本块生成问题时,为GPT-4o-mini使用的法语提示词:
Les informations contextuelles sont ci-dessous.
---------------------
{context_str}
---------------------
Étant donné les informations contextuelles et non des connaissances antérieures,
générez uniquement des questions financières de haute qualité basées sur la requête ci-dessous.
Vous êtes un professeur spécialiste en finance. Votre tâche est de préparer \{num_questions_per_chunk} questions pour un prochain \quiz/examen axé sur des sujets financiers. Les questions doivent être \variées et couvrir divers aspects de la finance, tels que \la comptabilité, l'investissement, l'analyse de marché et les régulations financières, \dans tout le document. Limitez les questions aux \informations contextuelles fournies.
## 适用场景 🎯
本数据集适用于:
- 面向法语金融检索增强生成任务的嵌入模型微调
- 评估嵌入模型在金融场景下的性能表现
- 作为开发金融问答系统的基础数据集
## 数据集加载 💻
如需加载并探索本数据集,可使用以下Python代码:
python
from datasets import load_dataset
def load_and_print_dataset_info(dataset_name):
dataset = load_dataset(dataset_name)
print(f"
Dataset: {dataset_name}")
print(f"Number of train examples: {len(dataset['train'])}")
print(f"Number of test examples: {len(dataset['test'])}")
print("Sample from train set:")
print(dataset['train'][0])
print("
Sample from test set:")
print(dataset['test'][0])
return dataset
# Load and print info for French dataset
fr = load_and_print_dataset_info("sujet-ai/Sujet-Financial-RAG-FR-Dataset")
## 数据来源 📚
### 训练集
1. [法国航空-荷兰皇家航空集团2023年业绩报告](https://www.airfranceklm.com/sites/default/files/2024-02/20240228_-_q4_fy_2023_results_-_afklm_-_press_release_fr_0.pdf)
2. [安联住宅2022年度报告](https://francescpi.com/scpi-de-rendement/allianz-home/ra/allianz-home-rapport-annuel-2022.pdf)
3. [空客集团2023年度业绩报告](https://www.airbus.com/sites/g/files/jlcbta136/files/2024-02/FR-Press-Release-Airbus-FY2023-Results.pdf)
4. [BPCE集团2024年第一季度业绩报告](https://newsroom.groupebpce.fr/assets/cp-resultats-groupe-bpce-t1-24-vf-pdf-9a19-7b707.html)
5. [法国巴黎银行2023年度业绩报告](https://cdn-group.bnpparibas.com/uploads/file/CP_BNPP_R%C3%A9sultats_Annuels_2023_FR.pdf)
6. [法国电力集团2023年度运营报告](https://www.edf.fr/sites/groupe/files/2024-03/edf-resultats-annuels-2023-rapport-activite-2024-03-01.pdf)
7. [法国最高审计法院2023年度报告](https://www.economie.gouv.fr/files/files/directions_services/hcsf/HCSF_Rapport_annuel_2023.pdf?v=1698223265)
8. [汇丰法国2022年度财务报告](https://www.about.hsbc.fr/-/media/france/fr/investors-relations/hsbc-sfh/230307-rapport-financier-annuel-2022.pdf)
9. [法国邮政集团2023年业绩报告](https://le-groupe-laposte.cdn.prismic.io/le-groupe-laposte/e6f6d760-3c9d-4324-9c6f-5cf7ff4235a3_Communique+de+presse+des+resultats+2023+du+groupe+La+Poste.pdf)
10. [法国经济部APE2020-2021年度财务报告](https://www.economie.gouv.fr/files/2021-10/Rapport%20financier-APE-2021.pdf)
11. [Orange Bank2023年度财务报告](https://www.orangebank.fr/dam/jcr:9a801d81-9f09-4b73-8771-06966315b5be/OB%202023%20-%20Rapport%20financier%20v2024-06-20.pdf)
12. [雷诺集团2023年度合并财务报表](https://www.renaultgroup.com/wp-content/uploads/2024/02/2023.12-comptes-consolides-2023-1.pdf)
13. [法国兴业银行SCF2021年度财务报告](https://www.societegenerale.com/sites/default/files/documents/2022-03/sg-scf-rapport-financier-annuel-2021.pdf)
14. [法国兴业银行SFH2023年度财务报告](https://www.societegenerale.com/sites/default/files/documents/2024-03/societe-generale-sfh-rapport-financier-annuel-2023.pdf)
15. [威望迪集团2022年度财务报告与合并财务报表](https://www.vivendi.com/wp-content/uploads/2023/03/20230308_VIV_Rapport-financier-et-Etats-financiers-consolides-de-lexercice-2022.pdf)
### 测试集
1. [法国兴业银行2024年第一季度业绩报告](https://www.societegenerale.com/sites/default/files/resultats_publication/fr/2024-05/t1-2024-Communique-presse_FR.pdf)
2. [法国巴黎银行2024年第一季度业绩报告](https://cdn-group.bnpparibas.com/uploads/file/CP_BNPP_R%C3%A9sultats_1T-2024_FR.pdf)
## 伦理考量 🤔
本数据集的使用者需注意以下几点:
- 本数据集源于公开文档,但使用时需遵守原始来源的版权协议与使用条款。
- 数据集内容仅反映数据集构建时的公开金融信息,可能无法反映相关主体当前的财务状况。
- AI生成的问答内容可能存在生成过程中固有的偏差与不准确之处。
## 未来规划 🔮
- 扩充数据集来源,引入更多样化的金融文档
- 定期更新数据集,纳入最新金融报告
- 针对特定金融细分领域创建专用子集
- 提升每个文本块生成的问题数量,构建规模更大、覆盖更全面的数据集
---
若您在研究或应用中使用本数据集,请按以下格式引用:
@software{Sujet-Financial-RAG-FR-Dataset,
author = {Sujet AI, Allaa Boutaleb, Hamed Rahimi},
title = {Sujet-Financial-RAG-FR-Dataset: A synthetically generated french financial QA dataset to finetune embedding models},
year = {2024},
url = {https://huggingface.co/datasets/sujet-ai/Sujet-Financial-RAG-FR-Dataset}
}
如有疑问、反馈或合作意向,欢迎通过[LinkedIn](https://www.linkedin.com/company/sujet-ai/)与我们取得联系,或访问我们的官网[https://sujet.ai](https://sujet.ai)
提供机构:
sujet-ai



