sujet-ai/Sujet-Financial-RAG-EN-Dataset
收藏Hugging Face2024-07-28 更新2025-04-12 收录
下载链接:
https://hf-mirror.com/datasets/sujet-ai/Sujet-Financial-RAG-EN-Dataset
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: question
dtype: string
- name: context
dtype: string
splits:
- name: train
num_bytes: 242642511
num_examples: 98590
- name: test
num_bytes: 23907031
num_examples: 7068
download_size: 11253933
dataset_size: 266549542
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
- split: test
path: data/test-*
license: mit
language:
- en
tags:
- finance
- financial embedding
- financial qa
- financial question answer
- financial rag
- embedding model finetuning
size_categories:
- 10K<n<100K
---
# Sujet Financial RAG EN Dataset 📊💼
## Description 📝
The Sujet Financial RAG EN Dataset is a comprehensive collection of English question-context pairs, specifically designed for training and evaluating embedding models in the financial domain. To demonstrate the importance of this approach, we hand-selected a variety of publicly available English financial documents, with a focus on 10-K Forms.
A 10-K Form is a comprehensive report filed annually by public companies about their financial performance. Required by the U.S. Securities and Exchange Commission (SEC), the report provides a detailed picture of a company's business, financial condition, and results of operations.
This dataset was utilized to fine-tune the embedding models [sujet-ai/Marsilia-Embeddings-EN-Base](https://huggingface.co/sujet-ai/Marsilia-Embeddings-EN-Base) and [sujet-ai/Marsilia-Embeddings-EN-Large](https://huggingface.co/sujet-ai/Marsilia-Embeddings-EN-Large), demonstrating the critical importance of fine-tuning open-source models for deploying high-performance RAG (Retrieval-Augmented Generation) applications.
It's important to note that it remains entirely possible and fairly straightforward to gather even more financial documents and generate additional questions per chunk to create much bigger and richer datasets!
## Dataset Content 📊
- **Total Samples**: 105,658
- Training Set: 98,590 pairs
- Test Set: 7,068 pairs
- **Columns**:
- `question`: A generated financial question
- `context`: The corresponding context where the answer can be found
## Creation Methodology 🛠️
1. **Data Collection**: Financial reports, primarily 10-K Forms, and other official documents from various companies and financial institutions were carefully selected.
2. **Preprocessing**: PDF documents were converted to text and split into chunks.
3. **Question Generation**: For each valid chunk, 20 financial questions were generated using the GPT-4o-mini model, employing a specialized prompt.
4. **Post-processing**: Questions generated from empty or invalid chunks were removed.
### Question Generation Prompt 🤖
The following prompt was used with GPT-4o-mini to generate questions for each chunk:
```
Context information is below.
---------------------
{context_str}
---------------------
Given the context information and not prior knowledge,
generate only high-quality financial questions based on the below query.
You are a Professor specialized in finance. Your task is to setup \
{num_questions_per_chunk} questions for an upcoming \
quiz/examination focused on financial topics. The questions should be \
diverse in nature and cover various aspects of finance, such as \
accounting, investment, market analysis, and financial regulations, \
across the document. Restrict the questions to the \
context information provided.
```
## Intended Use 🎯
This dataset is designed for:
- Fine-tuning embedding models for English financial RAG tasks
- Evaluating embedding model performance in financial contexts
- Serving as a foundation for developing financial question-answering systems
## Loading the Dataset 💻
To load and explore the dataset, you can use the following Python code:
```python
from datasets import load_dataset
def load_and_print_dataset_info(dataset_name):
dataset = load_dataset(dataset_name)
print(f"\nDataset: {dataset_name}")
print(f"Number of train examples: {len(dataset['train'])}")
print(f"Number of test examples: {len(dataset['test'])}")
print("Sample from train set:")
print(dataset['train'][0])
print("\nSample from test set:")
print(dataset['test'][0])
return dataset
# Load and print info for English dataset
en = load_and_print_dataset_info("sujet-ai/Sujet-Financial-RAG-EN-Dataset")
```
## Data Sources 📚
### Training Set
1. [Alphabet Inc. - 10-K Form 2023](https://abc.xyz/assets/43/44/675b83d7455885c4615d848d52a4/goog-10-k-2023.pdf)
2. [Apple Inc. - 10-K Form 2023](https://d18rn0p25nwr6d.cloudfront.net/CIK-0000320193/faab4555-c69b-438a-aaf7-e09305f87ca3.pdf)
3. [Bank of America - 10-K Form 2023](https://investor.bankofamerica.com/regulatory-and-other-filings/annual-reports/content/0001140361-24-014731/0001140361-24-014731.pdf)
4. [BlackRock - 10-K Form 2023](https://d18rn0p25nwr6d.cloudfront.net/CIK-0001364742/c2c250f4-22de-4bea-9e87-ad8816ebe178.pdf)
5. [Credit Suisse - Annual Report 2023](https://www.credit-suisse.com/media/assets/corporate/docs/about-us/investor-relations/financial-disclosures/financial-reports/csag-ar-2023-en.pdf)
6. [Edward Jones - 10-K Form 2023](https://www.sec.gov/Archives/edgar/data/815917/000095017024029758/ck0000815917-jones-10k-2023.pdf)
7. [Goldman Sachs - 10-K Form 2023](https://www.goldmansachs.com/investor-relations/financials/10k/2023/2023-10-k.pdf)
8. [Microsoft - 10-K Form 2023](https://c.s-microsoft.com/en-us/CMSFiles/MSFT_FY23Q4_10K.docx?version=d86a284d-dfce-35ee-366c-d754d90f9174)
9. [PayPal - Form 8-K May 22, 2024](https://s201.q4cdn.com/231198771/files/doc_events/2024/May/22/paypal-2024-annual-meeting-voting-results.pdf)
10. [UBS - 1Q24 Financial Report](https://www.ubs.com/content/dam/assets/cc/investor-relations/quarterlies/2024/1q24/1q24-media-release-en.pdf)
11. [Vanguard - 2023 Financial Annual Report](https://fund-docs.vanguard.com/etf-annual-report.pdf)
12. [Uber - Form 10-K 2024](https://d18rn0p25nwr6d.cloudfront.net/CIK-0001543151/6fabd79a-baa9-4b08-84fe-deab4ef8415f.pdf)
### Test Set
1. [Lyft - 10-K Form 2024](https://d18rn0p25nwr6d.cloudfront.net/CIK-0001759509/d576a7f4-780c-4f39-86a6-aa54b03fa2ec.pdf)
2. [Verizon - 10-K Form 2024](https://quotes.quotemedia.com/data/downloadFiling?webmasterId=104600&ref=318048243&type=PDF&formType=10-K&formDescription=Annual+report+pursuant+to+Section+13+or+15%28d%29&dateFiled=2024-02-09&cik=0000732712)
## Ethical Considerations 🤔
Users of this dataset should be aware that:
- The data comes from public documents, but its use must respect the copyright and terms of use of the original sources.
- The content reflects the financial information available at the time of dataset creation and may not represent current financial situations.
- AI-generated questions may contain biases or inaccuracies inherent to the generation process.
## Future Work 🔮
- Expansion of the dataset with more diverse sources
- Regular updates with the latest financial reports
- Creation of specialized subsets for specific financial sectors
- Increasing the number of questions generated per chunk to create an even larger, more comprehensive dataset
---
## Citation 📄
If you use this dataset in your research or applications, please cite it as follows:
```
@software{Sujet-Financial-RAG-EN-Dataset,
author = {Sujet AI, Allaa Boutaleb, Hamed Rahimi},
title = {Sujet-Financial-RAG-EN-Dataset: A synthetically generated English financial QA dataset to finetune embedding models},
year = {2024},
url = {https://huggingface.co/datasets/sujet-ai/Sujet-Financial-RAG-EN-Dataset}
}
```
## Contact Information 📮
For questions, feedback, or collaborations, please reach out to us on [LinkedIn](https://www.linkedin.com/company/sujet-ai/) or visit our website [https://sujet.ai](https://sujet.ai).
数据集元信息:
特征字段:
- 字段名:question,数据类型:字符串
- 字段名:context,数据类型:字符串
数据集划分:
- 划分集名称:train,字节数:242642511,样本数:98590
- 划分集名称:test,字节数:23907031,样本数:7068
下载大小:11253933
数据集总大小:266549542
配置项:
- 配置名称:default
数据文件:
- 划分集:train,路径:data/train-*
- 划分集:test,路径:data/test-*
许可证:MIT许可证
语言:英语
标签:
- 金融
- 金融嵌入模型(financial embedding)
- 金融问答(financial qa)
- 金融问答(financial question answer)
- 金融检索增强生成(financial rag, Retrieval-Augmented Generation)
- 嵌入模型微调(embedding model finetuning)
样本规模分类:
- 10K < 样本数 < 100K
# Sujet 金融检索增强生成英文数据集 📊💼
## 数据集描述 📝
Sujet 金融检索增强生成(Retrieval-Augmented Generation, RAG)英文数据集是一套全面的英文问题-上下文配对集合,专为金融领域嵌入模型(embedding model)的训练与评估设计。为凸显本数据集的应用价值,我们手工遴选了多份公开可得的英文金融文档,重点聚焦10-K年报表格(10-K Form)。
10-K年报表格是上市公司每年提交的综合财务业绩报告,由美国证券交易委员会(U.S. Securities and Exchange Commission, SEC)强制要求出具,该报告可完整呈现企业的经营业务、财务状况与运营成果。
本数据集曾用于微调嵌入模型[sujet-ai/Marsilia-Embeddings-EN-Base](https://huggingface.co/sujet-ai/Marsilia-Embeddings-EN-Base)与[sujet-ai/Marsilia-Embeddings-EN-Large](https://huggingface.co/sujet-ai/Marsilia-Embeddings-EN-Large),验证了针对开源模型进行微调,对部署高性能检索增强生成应用的关键意义。
需特别说明:收集更多金融文档,并为每个文本块生成更多问题,以构建规模更大、内容更丰富的数据集,完全可行且操作简便。
## 数据集内容 📊
- **总样本量**:105,658
- 训练集:98,590 组配对
- 测试集:7,068 组配对
- **字段说明**:
- `question`:生成的金融问题
- `context`:包含对应答案的上下文文本
## 数据集构建方法 🛠️
1. **数据采集**:精心遴选来自多家企业与金融机构的财务报告(以10-K年报表格为主)及其他官方文档。
2. **预处理**:将PDF文档转换为文本并切分为文本块。
3. **问题生成**:针对每个有效文本块,使用GPT-4o-mini模型结合专属提示词,生成20个金融问题。
4. **后处理**:剔除来自空文本块或无效文本块的生成问题。
### 问题生成提示词 🤖
以下为用于GPT-4o-mini为每个文本块生成问题的提示词:
Context information is below.
---------------------
{context_str}
---------------------
Given the context information and not prior knowledge,
generate only high-quality financial questions based on the below query.
You are a Professor specialized in finance. Your task is to setup \{num_questions_per_chunk} questions for an upcoming \quiz/examination focused on financial topics. The questions should be \diverse in nature and cover various aspects of finance, such as \accounting, investment, market analysis, and financial regulations, \across the document. Restrict the questions to the \context information provided.
## 预期用途 🎯
本数据集适用于:
- 针对英文金融检索增强生成任务的嵌入模型微调
- 评估嵌入模型在金融场景下的性能表现
- 作为开发金融问答系统的基础数据集
## 数据集加载方法 💻
如需加载并探索本数据集,可使用以下Python代码:
python
from datasets import load_dataset
def load_and_print_dataset_info(dataset_name):
dataset = load_dataset(dataset_name)
print(f"
Dataset: {dataset_name}")
print(f"Number of train examples: {len(dataset['train'])}")
print(f"Number of test examples: {len(dataset['test'])}")
print("Sample from train set:")
print(dataset['train'][0])
print("
Sample from test set:")
print(dataset['test'][0])
return dataset
# Load and print info for English dataset
en = load_and_print_dataset_info("sujet-ai/Sujet-Financial-RAG-EN-Dataset")
## 数据来源 📚
### 训练集数据源
1. [Alphabet Inc. - 2023年10-K年报表格](https://abc.xyz/assets/43/44/675b83d7455885c4615d848d52a4/goog-10-k-2023.pdf)
2. [Apple Inc. - 2023年10-K年报表格](https://d18rn0p25nwr6d.cloudfront.net/CIK-0000320193/faab4555-c69b-438a-aaf7-e09305f87ca3.pdf)
3. [Bank of America - 2023年10-K年报表格](https://investor.bankofamerica.com/regulatory-and-other-filings/annual-reports/content/0001140361-24-014731/0001140361-24-014731.pdf)
4. [BlackRock - 2023年10-K年报表格](https://d18rn0p25nwr6d.cloudfront.net/CIK-0001364742/c2c250f4-22de-4bea-9e87-ad8816ebe178.pdf)
5. [Credit Suisse - 2023年年度报告](https://www.credit-suisse.com/media/assets/corporate/docs/about-us/investor-relations/financial-disclosures/financial-reports/csag-ar-2023-en.pdf)
6. [Edward Jones - 2023年10-K年报表格](https://www.sec.gov/Archives/edgar/data/815917/000095017024029758/ck0000815917-jones-10k-2023.pdf)
7. [Goldman Sachs - 2023年10-K年报表格](https://www.goldmansachs.com/investor-relations/financials/10k/2023/2023-10-k.pdf)
8. [Microsoft - 2023年10-K年报表格](https://c.s-microsoft.com/en-us/CMSFiles/MSFT_FY23Q4_10K.docx?version=d86a284d-dfce-35ee-366c-d754d90f9174)
9. [PayPal - 2024年5月22日8-K表格](https://s201.q4cdn.com/231198771/files/doc_events/2024/May/22/paypal-2024-annual-meeting-voting-results.pdf)
10. [UBS - 2024年第一季度财务报告](https://www.ubs.com/content/dam/assets/cc/investor-relations/quarterlies/2024/1q24/1q24-media-release-en.pdf)
11. [Vanguard - 2023年财务年度报告](https://fund-docs.vanguard.com/etf-annual-report.pdf)
12. [Uber - 2024年10-K年报表格](https://d18rn0p25nwr6d.cloudfront.net/CIK-0001543151/6fabd79a-baa9-4b08-84fe-deab4ef8415f.pdf)
### 测试集数据源
1. [Lyft - 2024年10-K年报表格](https://d18rn0p25nwr6d.cloudfront.net/CIK-0001759509/d576a7f4-780c-4f39-86a6-aa54b03fa2ec.pdf)
2. [Verizon - 2024年10-K年报表格](https://quotes.quotemedia.com/data/downloadFiling?webmasterId=104600&ref=318048243&type=PDF&formType=10-K&formDescription=Annual+report+pursuant+to+Section+13+or+15%28d%29&dateFiled=2024-02-09&cik=0000732712)
## 伦理注意事项 🤔
使用本数据集的用户需知晓:
- 数据源自公开文档,但使用时需尊重原始来源的版权与使用条款。
- 数据集内容仅反映数据集构建时的公开财务信息,可能无法反映企业当前的财务状况。
- AI生成的问题可能带有生成过程中固有的偏差与不准确之处。
## 后续工作计划 🔮
- 扩充数据集来源,覆盖更多样化的数据源
- 定期更新数据集,纳入最新的财务报告
- 创建针对特定金融细分领域的专用子集
- 提升每个文本块生成的问题数量,构建规模更大、覆盖更全面的数据集
---
## 引用方式 📄
如在研究或应用中使用本数据集,请按以下格式引用:
@software{Sujet-Financial-RAG-EN-Dataset,
author = {Sujet AI, Allaa Boutaleb, Hamed Rahimi},
title = {Sujet-Financial-RAG-EN-Dataset: A synthetically generated English financial QA dataset to finetune embedding models},
year = {2024},
url = {https://huggingface.co/datasets/sujet-ai/Sujet-Financial-RAG-EN-Dataset}
}
## 联系方式 📮
如有疑问、反馈或合作意向,请通过[LinkedIn](https://www.linkedin.com/company/sujet-ai/)联系我们,或访问我们的官网[https://sujet.ai](https://sujet.ai)。
提供机构:
sujet-ai



