five

Finance-Instruct-500k

收藏
魔搭社区2025-12-05 更新2025-09-06 收录
下载链接:
https://modelscope.cn/datasets/Josephgflowers/Finance-Instruct-500k
下载链接
链接失效反馈
官方服务:
资源简介:
# Finance-Instruct-500k Dataset ## Overview **Finance-Instruct-500k** is a comprehensive and meticulously curated dataset designed to train advanced language models for financial tasks, reasoning, and multi-turn conversations. Combining data from numerous high-quality financial datasets, this corpus provides over **500,000 entries**, offering unparalleled depth and versatility for finance-related instruction tuning and fine-tuning. The dataset includes content tailored for financial reasoning, question answering, entity recognition, sentiment analysis, address parsing, and multilingual natural language processing (NLP). Its diverse and deduplicated entries make it suitable for a wide range of financial AI applications, including domain-specific assistants, conversational agents, and information extraction systems. Most entries include system, user and assistant fields. Recent additions include: - **[BAAI/IndustryInstruction_Finance-Economics](https://huggingface.co/datasets/BAAI/IndustryInstruction_Finance-Economics)**: Broader financial instructions and **Chinese** language coverage. - **[Josephgflowers/Financial-NER-NLP](https://huggingface.co/datasets/Josephgflowers/Financial-NER-NLP)**: Advanced **XBRL tagging** and named-entity recognition examples. ![image/png](https://cdn-uploads.huggingface.co/production/uploads/6328952f798f8d122ce62a44/MgkW6-hDXoZPYbpVbH5f4.png) --- ## Key Features - **Extensive Coverage**: Over 500,000 entries spanning financial QA, reasoning, sentiment analysis, topic classification, multilingual NER, and conversational AI. - **Multi-Turn Conversations**: Rich dialogues emphasizing contextual understanding and reasoning. - **Diverse Data Sources**: Includes entries from **Cinder**, **Sujet-Finance-Instruct-177k**, **Phinance Dataset**, **BAAI/IndustryInstruction_Finance-Economics**, **Josephgflowers/Financial-NER-NLP**, and many other high-quality datasets. - **RAG-Formatted Data**: Retrieval-augmented generation (RAG) tasks include external data prepended to the `user` field for enhanced contextual understanding. - **Deduplicated and Preprocessed**: Eliminates overlaps and irregular entries for cleaner and higher-quality data. - **XBRL Tagging**: Includes structured finance entity labeling from **Financial-NER-NLP** for advanced extraction tasks. --- **Future Plans** 1M! Like my work? Want to see more? Custom request? Message me on discord: joseph.flowers.ra Donate here: https://buymeacoffee.com/josephgflowers --- ## Supported Tasks and Use Cases 1. **Financial Question Answering**: - Contextual and direct-answer financial QA. - Multilingual QA and financial terminology explanation. 2. **Reasoning Tasks**: - Symbolic and numeric reasoning. - Portfolio analysis and investment strategy simulation. 3. **Conversational AI**: - Multi-turn dialogues to develop finance-specific assistants and advisors. 4. **Named Entity Recognition (NER)**: - Multilingual financial entity recognition. - XBRL tagging for structured finance data (via **Financial-NER-NLP**). - Address parsing and PII handling. 5. **Sentiment Analysis**: - Text classification as bullish, bearish, neutral, positive, or negative. - Entity-level sentiment analysis. 6. **Topic Classification**: - Categorization of financial texts into topics such as market trends, risk analysis, and economic events. 7. **Lightweight LLM Training**: - Domain-specific fine-tuning for smaller models in resource-constrained environments. 8. **RAG Applications**: - Seamless integration with external data using prepended context in the `user` field. --- ## Dataset Composition The dataset is a deduplicated combination of the following sources filtered for finance-related entries or tasks: 1. **[alvanlii/finance-textbooks](https://huggingface.co/datasets/alvanlii/finance-textbooks)** 2. **[glaiveai/RAG-v1](https://huggingface.co/datasets/glaiveai/RAG-v1)** 3. **[instruction-pretrain/ft-instruction-synthesizer-collection](https://huggingface.co/datasets/instruction-pretrain/ft-instruction-synthesizer-collection)** (NewsQA, ConvFinQA, WikiTableQA) 4. **[gretelai/gretel-pii-masking-en-v1](https://huggingface.co/datasets/gretelai/gretel-pii-masking-en-v1)** 5. **[CohereForAI/aya_dataset (HotpotQA)](https://huggingface.co/datasets/CohereForAI/aya_dataset)** 6. **[CohereForAI/aya_dataset](https://huggingface.co/datasets/CohereForAI/aya_dataset)** 7. **[nvidia/OpenMathInstruct-1](https://huggingface.co/datasets/Nvidia-OpenMathInstruct)** 8. **[TIGER-Lab/WebInstructSub](https://huggingface.co/datasets/TIGER-Lab/WebInstructSub)** 9. **[glaiveai/glaive-code-assistant-v3](https://huggingface.co/datasets/glaiveai/glaive-code-assistant-v3)** 10. **[Open-Orca/1million-gpt-4](https://huggingface.co/datasets/Open-Orca/1million-gpt-4)** 11. **[Norquinal/claude_evol_instruct_210k](https://huggingface.co/datasets/Norquinal/claude_evol_instruct_210k)** 12. **[migtissera/Synthia-v1.3](https://huggingface.co/datasets/migtissera/Synthia-v1.3)** 13. **[meta-math/MetaMathQA](https://huggingface.co/datasets/meta-math/MetaMathQA)** 14. **[HuggingFaceTB/cosmopedia](https://huggingface.co/datasets/HuggingFaceTB/cosmopedia)** 15. **[Josephgflowers/PII-NER](https://huggingface.co/datasets/Josephgflowers/PII-NER)** 16. **[gbharti/finance-alpaca](https://huggingface.co/datasets/gbharti/finance-alpaca)** 17. **[ugursa/Yahoo-Finance-News-Sentences](https://huggingface.co/datasets/ugursa/Yahoo-Finance-News-Sentences)** 18. **[AdaptLLM/finance-tasks_Headline](https://huggingface.co/datasets/AdaptLLM/finance-tasks_Headline)** 19. **[ceadar-ie/FinTalk-19k](https://huggingface.co/datasets/ceadar-ie/FinTalk-19k)** 20. **[zeroshot/twitter-financial-news-topic](https://huggingface.co/datasets/zeroshot/twitter-financial-news-topic)** 21. **[dylanalloy/ehc-contrived-financial](https://huggingface.co/datasets/dylanalloy/ehc-contrived-financial)** 22. **[zeroshot/twitter-financial-news-sentiment](https://huggingface.co/datasets/zeroshot/twitter-financial-news-sentiment)** 23. **[financial_phrasebank](https://huggingface.co/datasets/financial_phrasebank)** 24. **[AdiOO7/llama-2-finance](https://huggingface.co/datasets/AdiOO7/llama-2-finance)** 25. **[amphora/lmsys-finance](https://huggingface.co/datasets/amphora/lmsys-finance)** 26. **[AdaptLLM/finance-tasks_ConvFinQA](https://huggingface.co/datasets/AdaptLLM/finance-tasks_ConvFinQA)** 27. **[KennNguyenDev/FiQA_Financial_Phrasebank_Combined](https://huggingface.co/datasets/KennNguyenDev/FiQA_Financial_Phrasebank_Combined)** 28. **[AdaptLLM/finance-tasks_FPB](https://huggingface.co/datasets/AdaptLLM/finance-tasks_FPB)** 29. **[yixuantt/FinEntity](https://huggingface.co/datasets/yixuantt/FinEntity)** 30. **[AdaptLLM/finance-tasks_FiQA_SA](https://huggingface.co/datasets/AdaptLLM/finance-tasks_FiQA_SA)** 31. **[nickmuchi/financial-classification](https://huggingface.co/datasets/nickmuchi/financial-classification)** 32. **[lighthouzai/finqabench](https://huggingface.co/datasets/lighthouzai/finqabench)** 33. **[adityamavle/FinRiskAnalysis](https://huggingface.co/datasets/adityamavle/FinRiskAnalysis)** 34. **[sujet-ai/Sujet-Finance-Instruct-177k](https://huggingface.co/datasets/sujet-ai/Sujet-Finance-Instruct-177k)** 35. **Cinder (custom multi-turn reasoning and parsing data)** 36. **[BAAI/IndustryInstruction_Finance-Economics](https://huggingface.co/datasets/BAAI/IndustryInstruction_Finance-Economics)** 37. **[Josephgflowers/Financial-NER-NLP](https://huggingface.co/datasets/Josephgflowers/Financial-NER-NLP)** --- ## Dataset Structure ### Data Fields - **system**: Instruction or system message providing context. - **user**: Query or input from the user, with RAG data prepended if applicable. - **assistant**: Model-generated response or ground truth answer. ### Format All entries are structured as: - `system`, `user`, `assistant` ### Example Entry ```json { "system": "You are a financial assistant. Answer the user's question accurately but keep it brief.", "user": "What is the difference between stocks and bonds?", "assistant": "Stocks represent ownership in a company, while bonds are a form of debt where the investor lends money to an entity." } ``` --- ## Languages - **English**: Primary language of the dataset. - **Chinese**: Increased coverage thanks to the **BAAI/IndustryInstruction_Finance-Economics** subset. - **Other Multilingual Data**: Includes content from multilingual NER sources and the Aya datasets. --- ## Collection and Preprocessing 1. **Deduplication**: Removed over 60,000 duplicate entries across all source datasets. 2. **Data Cleaning**: Eliminated non-ASCII characters, fixed punctuation and spelling, and ensured consistent formatting. 3. **Dataset Merging**: Unified entries from various sources into a single cohesive dataset. # Lost in the dedupe. Future releases will restore Annotations. 4. **Annotation**: Enhanced metadata for clarity and usability, including task types and system prompts. 5. **XBRL Tagging**: Integrated from **Financial-NER-NLP** subset, enabling structured labeling of financial instruments. --- ## Ethical Considerations - **User Privacy**: All PII is synthetic and anonymized to ensure compliance with privacy standards. - **Professional Use Only**: This dataset is not a substitute for certified financial guidance or professional advice. --- ## Limitations - **Bias**: Coverage may skew toward certain financial sectors or topics based on dataset distribution. - **Accuracy**: Outputs trained on this dataset require validation for critical financial applications. - **Multilingual Support**: Non-English entries vary in volume, though recent additions (BAAI dataset) increase Chinese content. --- ## Citation If you use this dataset, please cite: ```bibtex @dataset{josephgflowers2025financeinstruct, title={Finance-Instruct-500k}, author={Joseph G. Flowers}, year={2025}, url={https://huggingface.co/datasets/Josephgflowers/Finance-Instruct-500k} } ``` --- ## How to Load the Dataset ```python from datasets import load_dataset dataset = load_dataset("Josephgflowers/Finance-Instruct-500k") print(dataset["train"][0]) ``` --- ## License This dataset is released under the Apache 2.0 license.

# Finance-Instruct-500k 数据集 ## 概述 **Finance-Instruct-500k** 是一款经细致甄选与严格整理的数据集,旨在训练面向金融任务、推理与多轮对话的高级语言模型。本语料库整合了众多优质金融数据集的内容,包含超过**50万条数据条目**,可为金融相关的指令微调与全模型微调提供无与伦比的深度与通用性。 该数据集涵盖适配金融推理、问答、实体识别、情感分析、地址解析以及多语言自然语言处理(Natural Language Processing,NLP)的内容。其多样化且经过去重的条目可适配各类金融人工智能(Artificial Intelligence,AI)应用场景,包括领域专属助手、对话AI智能体(AI Agent)以及信息抽取系统。 多数数据条目包含`system`、`user`与`assistant`三个字段。 近期新增数据源包括: - **[BAAI/IndustryInstruction_Finance-Economics](https://huggingface.co/datasets/BAAI/IndustryInstruction_Finance-Economics)**:涵盖更广泛的金融指令与**中文**语料。 - **[Josephgflowers/Financial-NER-NLP](https://huggingface.co/datasets/Josephgflowers/Financial-NER-NLP)**:包含高级**可扩展商业报告语言(XBRL)**标注与命名实体识别示例。 ![image/png](https://cdn-uploads.huggingface.co/production/uploads/6328952f798f8d122ce62a44/MgkW6-hDXoZPYbpVbH5f4.png) --- ## 核心特性 - **覆盖范围广泛**:包含超50万条数据,涵盖金融问答(Question Answering,QA)、推理、情感分析、主题分类、多语言命名实体识别(Named Entity Recognition,NER)以及对话式AI应用。 - **支持多轮对话**:包含丰富对话语料,侧重上下文理解与推理能力。 - **数据源多样**:收录来自**Cinder**、**Sujet-Finance-Instruct-177k**、**Phinance Dataset**、**BAAI/IndustryInstruction_Finance-Economics**、**Josephgflowers/Financial-NER-NLP**等众多优质数据集的条目。 - **检索增强生成(Retrieval-Augmented Generation,RAG)格式数据**:RAG任务会将外部数据前置至`user`字段中,以强化上下文理解能力。 - **去重且预处理完成**:移除重复条目与异常数据,确保数据更简洁、质量更高。 - **支持XBRL标注**:包含来自**Financial-NER-NLP**的结构化金融实体标注数据,可用于高级抽取任务。 --- **未来规划**:目标扩容至100万条数据!如果您认可我的工作并希望看到更多相关内容,或有定制化需求,请通过Discord联系我:joseph.flowers.ra。也可通过以下链接进行捐赠:https://buymeacoffee.com/josephgflowers --- ## 支持的任务与应用场景 1. **金融问答**: - 基于上下文的直接答案型金融问答。 - 多语言问答与金融术语解释。 2. **推理任务**: - 符号推理与数值推理。 - 投资组合分析与投资策略模拟。 3. **对话式AI**: - 多轮对话语料,用于开发金融专属助手与顾问。 4. **命名实体识别(NER)**: - 多语言金融实体识别。 - 针对结构化金融数据的XBRL标注(基于**Financial-NER-NLP**数据集)。 - 地址解析与个人可识别信息(Personally Identifiable Information,PII)处理。 5. **情感分析**: - 将文本分类为看涨、看跌、中性、积极或消极。 - 实体级情感分析。 6. **主题分类**: - 将金融文本归类至不同主题,如市场趋势、风险分析与经济事件。 7. **轻量化大语言模型(Large Language Model,LLM)训练**: - 为资源受限环境下的小型模型提供领域专属微调支持。 8. **RAG应用**: - 通过`user`字段中前置的上下文,实现与外部数据的无缝集成。 --- ## 数据集构成 该数据集为经过去重处理的多源数据集整合版本,所有数据源均经过筛选以适配金融相关条目或任务: 1. **[alvanlii/finance-textbooks](https://huggingface.co/datasets/alvanlii/finance-textbooks)** 2. **[glaiveai/RAG-v1](https://huggingface.co/datasets/glaiveai/RAG-v1)** 3. **[instruction-pretrain/ft-instruction-synthesizer-collection](https://huggingface.co/datasets/instruction-pretrain/ft-instruction-synthesizer-collection)**(NewsQA、ConvFinQA、WikiTableQA) 4. **[gretelai/gretel-pii-masking-en-v1](https://huggingface.co/datasets/gretelai/gretel-pii-masking-en-v1)** 5. **[CohereForAI/aya_dataset (HotpotQA)](https://huggingface.co/datasets/CohereForAI/aya_dataset)** 6. **[CohereForAI/aya_dataset](https://huggingface.co/datasets/CohereForAI/aya_dataset)** 7. **[nvidia/OpenMathInstruct-1](https://huggingface.co/datasets/Nvidia-OpenMathInstruct)** 8. **[TIGER-Lab/WebInstructSub](https://huggingface.co/datasets/TIGER-Lab/WebInstructSub)** 9. **[glaiveai/glaive-code-assistant-v3](https://huggingface.co/datasets/glaiveai/glaive-code-assistant-v3)** 10. **[Open-Orca/1million-gpt-4](https://huggingface.co/datasets/Open-Orca/1million-gpt-4)** 11. **[Norquinal/claude_evol_instruct_210k](https://huggingface.co/datasets/Norquinal/claude_evol_instruct_210k)** 12. **[migtissera/Synthia-v1.3](https://huggingface.co/datasets/migtissera/Synthia-v1.3)** 13. **[meta-math/MetaMathQA](https://huggingface.co/datasets/meta-math/MetaMathQA)** 14. **[HuggingFaceTB/cosmopedia](https://huggingface.co/datasets/HuggingFaceTB/cosmopedia)** 15. **[Josephgflowers/PII-NER](https://huggingface.co/datasets/Josephgflowers/PII-NER)** 16. **[gbharti/finance-alpaca](https://huggingface.co/datasets/gbharti/finance-alpaca)** 17. **[ugursa/Yahoo-Finance-News-Sentences](https://huggingface.co/datasets/ugursa/Yahoo-Finance-News-Sentences)** 18. **[AdaptLLM/finance-tasks_Headline](https://huggingface.co/datasets/AdaptLLM/finance-tasks_Headline)** 19. **[ceadar-ie/FinTalk-19k](https://huggingface.co/datasets/ceadar-ie/FinTalk-19k)** 20. **[zeroshot/twitter-financial-news-topic](https://huggingface.co/datasets/zeroshot/twitter-financial-news-topic)** 21. **[dylanalloy/ehc-contrived-financial](https://huggingface.co/datasets/dylanalloy/ehc-contrived-financial)** 22. **[zeroshot/twitter-financial-news-sentiment](https://huggingface.co/datasets/zeroshot/twitter-financial-news-sentiment)** 23. **[financial_phrasebank](https://huggingface.co/datasets/financial_phrasebank)** 24. **[AdiOO7/llama-2-finance](https://huggingface.co/datasets/AdiOO7/llama-2-finance)** 25. **[amphora/lmsys-finance](https://huggingface.co/datasets/amphora/lmsys-finance)** 26. **[AdaptLLM/finance-tasks_ConvFinQA](https://huggingface.co/datasets/AdaptLLM/finance-tasks_ConvFinQA)** 27. **[KennNguyenDev/FiQA_Financial_Phrasebank_Combined](https://huggingface.co/datasets/KennNguyenDev/FiQA_Financial_Phrasebank_Combined)** 28. **[AdaptLLM/finance-tasks_FPB](https://huggingface.co/datasets/AdaptLLM/finance-tasks_FPB)** 29. **[yixuantt/FinEntity](https://huggingface.co/datasets/yixuantt/FinEntity)** 30. **[AdaptLLM/finance-tasks_FiQA_SA](https://huggingface.co/datasets/AdaptLLM/finance-tasks_FiQA_SA)** 31. **[nickmuchi/financial-classification](https://huggingface.co/datasets/nickmuchi/financial-classification)** 32. **[lighthouzai/finqabench](https://huggingface.co/datasets/lighthouzai/finqabench)** 33. **[adityamavle/FinRiskAnalysis](https://huggingface.co/datasets/adityamavle/FinRiskAnalysis)** 34. **[sujet-ai/Sujet-Finance-Instruct-177k](https://huggingface.co/datasets/sujet-ai/Sujet-Finance-Instruct-177k)** 35. **Cinder(自定义多轮推理与解析数据)** 36. **[BAAI/IndustryInstruction_Finance-Economics](https://huggingface.co/datasets/BAAI/IndustryInstruction_Finance-Economics)** 37. **[Josephgflowers/Financial-NER-NLP](https://huggingface.co/datasets/Josephgflowers/Financial-NER-NLP)** --- ## 数据集结构 ### 数据字段 - **system**:提供上下文的指令或系统提示信息。 - **user**:用户的查询或输入,若适用则前置RAG数据。 - **assistant**:模型生成的回复或标准答案。 ### 数据格式 所有数据条目均采用以下结构: - `system`、`user`与`assistant`字段 ### 示例条目 json { "system": "You are a financial assistant. Answer the user's question accurately but keep it brief.", "user": "What is the difference between stocks and bonds?", "assistant": "Stocks represent ownership in a company, while bonds are a form of debt where the investor lends money to an entity." } --- ## 支持语言 - **英语**:数据集的主要语言。 - **中文**:借助**BAAI/IndustryInstruction_Finance-Economics**子集,中文语料覆盖范围有所提升。 - **其他多语言数据**:包含来自多语言NER数据源与Aya数据集的语料。 --- ## 采集与预处理 1. **去重处理**:移除了所有源数据集中超过6万条重复条目。 2. **数据清洗**:剔除非ASCII字符,修正标点与拼写错误,并确保格式统一。 3. **数据集合并**:将多源数据条目整合为一个统一连贯的数据集。 # 去重过程中丢失了部分标注信息,未来版本将恢复标注功能。 4. **标注优化**:补充元数据以提升清晰度与易用性,包括任务类型与系统提示。 5. **XBRL标注集成**:从**Financial-NER-NLP**子集中整合标注数据,支持金融工具的结构化标注。 --- ## 伦理考量 - **用户隐私**:所有个人可识别信息(PII)均为合成且经过匿名化处理,符合隐私保护标准。 - **仅用于专业场景**:本数据集不可替代持证金融咨询或专业财务建议。 --- ## 局限性 - **数据偏差**:根据源数据集的分布情况,覆盖范围可能偏向特定金融行业或主题。 - **准确性限制**:基于本数据集训练的模型输出,在关键金融应用中需经过验证。 - **多语言支持不足**:非英语语料的体量存在差异,尽管近期新增的BAAI数据集提升了中文语料占比。 --- ## 引用格式 若您使用本数据集,请按以下格式引用: bibtex @dataset{josephgflowers2025financeinstruct, title={Finance-Instruct-500k}, author={Joseph G. Flowers}, year={2025}, url={https://huggingface.co/datasets/Josephgflowers/Finance-Instruct-500k} } --- ## 数据集加载方式 python from datasets import load_dataset dataset = load_dataset("Josephgflowers/Finance-Instruct-500k") print(dataset["train"][0]) --- ## 授权协议 本数据集采用Apache 2.0协议发布。
提供机构:
maas
创建时间:
2025-08-31
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作