five

Phinance

收藏
魔搭社区2025-11-27 更新2025-10-11 收录
下载链接:
https://modelscope.cn/datasets/Josephgflowers/Phinance
下载链接
链接失效反馈
官方服务:
资源简介:
--- # Dataset Card for Finance Domain Expert Dataset ## Dataset Description ### Summary This dataset is a finance-oriented corpus designed for training Phi 3+ series on tasks like financial QA, reasoning, and multi-turn conversational agents. It combines curated finance-specific and synthetic data, filtered from high-quality sources. Entries are preformatted in **PHI format**, supporting multi-turn conversations with variations such as system-user-assistant or system-data-user-assistant. ### Supported Tasks and Use Cases - **Financial QA**: Domain-specific question answering (e.g., market analysis, terminology). - **Conversational Agents**: Training multi-turn finance chatbots. - **Text Analysis**: Tasks like entity recognition, summarization, sentiment analysis. - **Reasoning**: Numeric and symbolic reasoning in finance. ### Languages - **English**: Main language. - **Multilingual**: Aya datasets. ![image/png](https://cdn-uploads.huggingface.co/production/uploads/6328952f798f8d122ce62a44/r95W-ot0C8INwJ5m2pheg.png) ## Dataset Structure ### Data Fields - **system**: Context-setting message. - **user**: Query or instruction. - **assistant**: Model response. - **data**: External content in specific entries (RAG-style). ### Format Each entry is preformatted in PHI 3 style: - `system`, `user`, `assistant` - or `system`, `data`, `user`, `assistant`. Conversations support multi-turn dialogues, often with 5+ rounds. ## Collection Process 1. **Filtering**: Most sources were filtered for finance content. 2. **Restructuring**: QA pairs reformatted into preformatted PHI-style multi-turn conversations. 3. **Cleaning**: Checked for quality, low-quality data removed, fixed punctuation and spelling errors. 4. **Multilingual Handling**: Aya includes multilingual and bilingual data. ## Usage - **Fine-Tuning**: Train LLMs on finance tasks and dialogues. - **Multi-Turn Training**: Build context-aware chatbots. - **Reasoning**: QA with numerical and table-based tasks. ### sources: - name: alvanlii/finance-textbooks description: "Comprehensive finance-focused dataset used without further filtering." link: "https://huggingface.co/datasets/alvanlii/finance-textbooks" - name: glaiveai/RAG-v1 (reformatted) description: "A subset emphasizing finance-specific content for retrieval tasks." link: "https://huggingface.co/datasets/glaiveai/RAG-v1" apache-2.0 - name: Synthesizer NewsQA, ConvFinQA, WikiTableQA description: "Cleaned, filtered, and reformatted." - name: gretelai/gretel-pii-masking-en-v1 description: "Synthetic dataset reformatted and processed for PII-focused LLM data extraction in finance contexts." link: "https://huggingface.co/datasets/gretelai/gretel-pii-masking-en-v1" apache-2.0 @dataset{gretel-pii-docs-en-v1, author = {Gretel AI}, title = {GLiNER Models for PII Detection through Fine-Tuning on Gretel-Generated Synthetic Documents}, year = {2024}, month = {10}, publisher = {Gretel}, } - name: CohereForAI/aya_dataset (HotpotQA) description: "Multilingual subset derived from translated HotpotQA with finance-related QA." link: "https://huggingface.co/datasets/CohereForAI/aya_dataset" apache-2.0 @misc{singh2024aya, title={Aya Dataset: An Open-Access Collection for Multilingual Instruction Tuning}, author={Shivalika Singh and Freddie Vargus and Daniel Dsouza and Börje F. Karlsson and Abinaya Mahendiran and Wei-Yin Ko and Herumb Shandilya and Jay Patel and Deividas Mataciunas and Laura OMahony and Mike Zhang and Ramith Hettiarachchi and Joseph Wilson and Marina Machado and Luisa Souza Moura and Dominik Krzemiński and Hakimeh Fadaei and Irem Ergün and Ifeoma Okoh and Aisha Alaagib and Oshan Mudannayake and Zaid Alyafeai and Vu Minh Chien and Sebastian Ruder and Surya Guthikonda and Emad A. Alghamdi and Sebastian Gehrmann and Niklas Muennighoff and Max Bartolo and Julia Kreutzer and Ahmet Üstün and Marzieh Fadaee and Sara Hooker}, year={2024}, eprint={2402.06619}, archivePrefix={arXiv}, primaryClass={cs.CL} } - name: CohereForAI/aya_dataset description: "Additional multilingual QA data with finance-focused filtering." link: "https://huggingface.co/datasets/CohereForAI/aya_dataset" apache-2.0 @misc{singh2024aya, title={Aya Dataset: An Open-Access Collection for Multilingual Instruction Tuning}, author={Shivalika Singh and Freddie Vargus and Daniel Dsouza and Börje F. Karlsson and Abinaya Mahendiran and Wei-Yin Ko and Herumb Shandilya and Jay Patel and Deividas Mataciunas and Laura OMahony and Mike Zhang and Ramith Hettiarachchi and Joseph Wilson and Marina Machado and Luisa Souza Moura and Dominik Krzemiński and Hakimeh Fadaei and Irem Ergün and Ifeoma Okoh and Aisha Alaagib and Oshan Mudannayake and Zaid Alyafeai and Vu Minh Chien and Sebastian Ruder and Surya Guthikonda and Emad A. Alghamdi and Sebastian Gehrmann and Niklas Muennighoff and Max Bartolo and Julia Kreutzer and Ahmet Üstün and Marzieh Fadaee and Sara Hooker}, year={2024}, eprint={2402.06619}, archivePrefix={arXiv}, primaryClass={cs.CL} } - name: nvidia/OpenMathInstruct-1 description: "Filtered for mathematical reasoning and finance-adjacent tasks." link: "https://huggingface.co/datasets/Nvidia-OpenMathInstruct" nvidia-licence @article{toshniwal2024openmath, title = {OpenMathInstruct-1: A 1.8 Million Math Instruction Tuning Dataset}, author = {Shubham Toshniwal and Ivan Moshkov and Sean Narenthiran and Daria Gitman and Fei Jia and Igor Gitman}, year = {2024}, journal = {arXiv preprint arXiv: Arxiv-2402.10176} } - name: TIGER-Lab/WebInstructSub description: "Web-instruction dataset filtered for finance relevance." link: "https://huggingface.co/datasets/TIGER-Lab/WebInstructSub" apache-2.0 @article{yue2024mammoth2, title={MAmmoTH2: Scaling Instructions from the Web}, author={Yue, Xiang and Zheng, Tuney and Zhang, Ge and Chen, Wenhu}, journal={Advances in Neural Information Processing Systems}, year={2024} } - name: glaiveai/glaive-code-assistant-v3 description: "Code-focused dialogues emphasizing financial contexts." link: "https://huggingface.co/datasets/glaiveai/glaive-code-assistant-v3" licence: apache-2.0 - name: glaiveai/RAG-v1 description: "Second segment emphasizing finance-specific retrieval and RAG-style tasks." link: "https://huggingface.co/datasets/glaiveai/RAG-v1" licence: apache-2.0 - name: Open-Orca/1million-gpt-4 description: "Finance-related instructions and responses extracted from the larger corpus." link: "https://huggingface.co/datasets/Open-Orca/1million-gpt-4" - name: Norquinal/claude_evol_instruct_210k description: "Finance-specific instructions and dialogues extracted from this corpus." link: "https://huggingface.co/datasets/Norquinal/claude_evol_instruct_210k" - name: migtissera/Synthia-v1.3synthia13 description: "Refined for finance-related QA and reasoning tasks." link: "https://huggingface.co/datasets/migtissera/Synthia-v1.3" - name: meta-math/MetaMathQA description: "A subset of MetaMath selected for extended mathematical reasoning with some finance overlap." licence: mit @article{yu2023metamath, title={MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models}, author={Yu, Longhui and Jiang, Weisen and Shi, Han and Yu, Jincheng and Liu, Zhengying and Zhang, Yu and Kwok, James T and Li, Zhenguo and Weller, Adrian and Liu, Weiyang}, journal={arXiv preprint arXiv:2309.12284}, year={2023} } - name: HuggingFaceTB/cosmopedia description: "Filtered and reformatted for finance-adjacent reasoning and data exploration tasks." link: "https://huggingface.co/datasets/HuggingFaceTB/cosmopedia" licence: apache-2.0 @software{benallal2024cosmopedia, author = {Ben Allal, Loubna and Lozhkov, Anton and Penedo, Guilherme and Wolf, Thomas and von Werra, Leandro}, title = {Cosmopedia}, month = February, year = 2024, url = {https://huggingface.co/datasets/HuggingFaceTB/cosmopedia} } - name: Josephgflowers/PII-NER link:"https://huggingface.co/datasets/Josephgflowers/PII-NER" ## Ethical Considerations - **User Privacy**: PII is synthetic. - **Professional Advice**: Outputs are not certified financial guidance. ## Limitations - **Accuracy**: Outputs may require expert validation. - **Bias**: Coverage may vary across finance sub-domains. - **Multilingual**: Non-English content is limited to Aya subsets. ## How to Load the Dataset ```python from datasets import load_dataset dataset = load_dataset("Josephgflowers/Phinance") print(dataset["train"][0])

# 金融领域专家数据集卡片 ## 数据集描述 ### 摘要 本数据集为面向金融领域的语料库,专为针对Phi 3+系列大语言模型(Large Language Model, LLM)的金融问答、推理及多轮对话AI智能体(AI Agent)等任务训练而设计。该数据集整合了精选的金融专属数据与合成数据,均从高质量数据源中筛选得到。所有数据条目均采用**PHI格式**预格式化,支持多轮对话,可采用系统-用户-助手或系统-数据-用户-助手的对话范式。 ### 支持任务与应用场景 - **金融问答**:面向金融领域的专属问答(如市场分析、专业术语解答)。 - **对话智能体**:训练多轮金融聊天机器人。 - **文本分析**:实体识别、文本摘要、情感分析等任务。 - **推理任务**:金融场景下的数值与符号推理。 ### 支持语言 - **英语**:主要使用语言。 - **多语言**:包含Aya数据集。 ![image/png](https://cdn-uploads.huggingface.co/production/uploads/6328952f798f8d122ce62a44/r95W-ot0C8INwJ5m2pheg.png) ## 数据集结构 ### 数据字段 - **system**:用于设定对话上下文的消息。 - **user**:用户查询或指令。 - **assistant**:模型生成的回复。 - **data**:部分条目中包含的外部内容(检索增强生成(Retrieval-Augmented Generation, RAG)风格)。 ### 格式规范 每个数据条目均采用PHI 3风格预格式化: - 采用`system`、`user`、`assistant`的范式 - 或`system`、`data`、`user`、`assistant`的范式。 对话支持多轮交互,通常包含5轮及以上对话内容。 ## 采集流程 1. **筛选环节**:绝大多数数据源均针对金融领域内容进行过滤。 2. **重构环节**:将问答对重构为预格式化的PHI风格多轮对话。 3. **清洗环节**:对数据质量进行校验,移除低质量数据,修正标点与拼写错误。 4. **多语言处理**:Aya数据集包含多语言与双语数据。 ## 使用方式 - **微调训练**:针对金融任务与对话场景微调大语言模型。 - **多轮训练**:构建具备上下文感知能力的聊天机器人。 - **推理任务**:支持包含数值与表格类任务的问答。 ### 数据源: - 数据源名称:alvanlii/finance-textbooks 数据源描述:全面的金融专属数据集,未进行额外筛选直接使用。 数据源链接:https://huggingface.co/datasets/alvanlii/finance-textbooks - 数据源名称:glaiveai/RAG-v1(已重构) 数据源描述:针对检索任务筛选的金融专属内容子集。 数据源链接:https://huggingface.co/datasets/glaiveai/RAG-v1 许可证:Apache-2.0 - 数据源名称:Synthesizer NewsQA, ConvFinQA, WikiTableQA 数据源描述:已完成清洗、筛选与格式重构。 - 数据源名称:gretelai/gretel-pii-masking-en-v1 数据源描述:针对金融场景下的个人可识别信息(Personally Identifiable Information, PII)提取任务重构与处理的合成数据集。 数据源链接:https://huggingface.co/datasets/gretelai/gretel-pii-masking-en-v1 许可证:Apache-2.0 bibtex @dataset{gretel-pii-docs-en-v1, author = {Gretel AI}, title = {GLiNER Models for PII Detection through Fine-Tuning on Gretel-Generated Synthetic Documents}, year = {2024}, month = {10}, publisher = {Gretel}, } - 数据源名称:CohereForAI/aya_dataset (HotpotQA) 数据源描述:源自翻译后的HotpotQA的多语言子集,包含金融相关问答。 数据源链接:https://huggingface.co/datasets/CohereForAI/aya_dataset 许可证:Apache-2.0 bibtex @misc{singh2024aya, title={Aya Dataset: An Open-Access Collection for Multilingual Instruction Tuning}, author={Shivalika Singh and Freddie Vargus and Daniel Dsouza and Börje F. Karlsson and Abinaya Mahendiran and Wei-Yin Ko and Herumb Shandilya and Jay Patel and Deividas Mataciunas and Laura OMahony and Mike Zhang and Ramith Hettiarachchi and Joseph Wilson and Marina Machado and Luisa Souza Moura and Dominik Krzemiński and Hakimeh Fadaei and Irem Ergün and Ifeoma Okoh and Aisha Alaagib and Oshan Mudannayake and Zaid Alyafeai and Vu Minh Chien and Sebastian Ruder and Surya Guthikonda and Emad A. Alghamdi and Sebastian Gehrmann and Niklas Muennighoff and Max Bartolo and Julia Kreutzer and Ahmet Üstün and Marzieh Fadaee and Sara Hooker}, year={2024}, eprint={2402.06619}, archivePrefix={arXiv}, primaryClass={cs.CL} } - 数据源名称:CohereForAI/aya_dataset 数据源描述:经金融领域筛选的额外多语言问答数据。 数据源链接:https://huggingface.co/datasets/CohereForAI/aya_dataset 许可证:Apache-2.0 bibtex @misc{singh2024aya, title={Aya Dataset: An Open-Access Collection for Multilingual Instruction Tuning}, author={Shivalika Singh and Freddie Vargus and Daniel Dsouza and Börje F. Karlsson and Abinaya Mahendiran and Wei-Yin Ko and Herumb Shandilya and Jay Patel and Deividas Mataciunas and Laura OMahony and Mike Zhang and Ramith Hettiarachchi and Joseph Wilson and Marina Machado and Luisa Souza Moura and Dominik Krzemiński and Hakimeh Fadaei and Irem Ergün and Ifeoma Okoh and Aisha Alaagib and Oshan Mudannayake and Zaid Alyafeai and Vu Minh Chien and Sebastian Ruder and Surya Guthikonda and Emad A. Alghamdi and Sebastian Gehrmann and Niklas Muennighoff and Max Bartolo and Julia Kreutzer and Ahmet Üstün and Marzieh Fadaee and Sara Hooker}, year={2024}, eprint={2402.06619}, archivePrefix={arXiv}, primaryClass={cs.CL} } - 数据源名称:nvidia/OpenMathInstruct-1 数据源描述:针对数学推理与金融相关任务筛选的数据集。 数据源链接:https://huggingface.co/datasets/Nvidia-OpenMathInstruct 许可证:nvidia-licence bibtex @article{toshniwal2024openmath, title = {OpenMathInstruct-1: A 1.8 Million Math Instruction Tuning Dataset}, author = {Shubham Toshniwal and Ivan Moshkov and Sean Narenthiran and Daria Gitman and Fei Jia and Igor Gitman}, year = {2024}, journal = {arXiv preprint arXiv: Arxiv-2402.10176} } - 数据源名称:TIGER-Lab/WebInstructSub 数据源描述:针对金融相关性筛选的网络指令数据集。 数据源链接:https://huggingface.co/datasets/TIGER-Lab/WebInstructSub 许可证:Apache-2.0 bibtex @article{yue2024mammoth2, title={MAmmoTH2: Scaling Instructions from the Web}, author={Yue, Xiang and Zheng, Tuney and Zhang, Ge and Chen, Wenhu}, journal={Advances in Neural Information Processing Systems}, year={2024} } - 数据源名称:glaiveai/glaive-code-assistant-v3 数据源描述:聚焦金融场景的代码导向对话数据集。 数据源链接:https://huggingface.co/datasets/glaiveai/glaive-code-assistant-v3 许可证:Apache-2.0 - 数据源名称:glaiveai/RAG-v1 数据源描述:第二部分数据,聚焦金融专属检索与检索增强生成风格任务。 数据源链接:https://huggingface.co/datasets/glaiveai/RAG-v1 许可证:Apache-2.0 - 数据源名称:Open-Orca/1million-gpt-4 数据源描述:从大型语料库中提取的金融相关指令与回复数据。 数据源链接:https://huggingface.co/datasets/Open-Orca/1million-gpt-4 - 数据源名称:Norquinal/claude_evol_instruct_210k 数据源描述:从该语料库中提取的金融专属指令与对话数据。 数据源链接:https://huggingface.co/datasets/Norquinal/claude_evol_instruct_210k - 数据源名称:migtissera/Synthia-v1.3synthia13 数据源描述:针对金融相关问答与推理任务优化的数据集。 数据源链接:https://huggingface.co/datasets/migtissera/Synthia-v1.3 - 数据源名称:meta-math/MetaMathQA 数据源描述:为扩展数学推理任务筛选的MetaMath子集,包含部分金融相关内容。 许可证:MIT bibtex @article{yu2023metamath, title={MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models}, author={Yu, Longhui and Jiang, Weisen and Shi, Han and Yu, Jincheng and Liu, Zhengying and Zhang, Yu and Kwok, James T and Li, Zhenguo and Weller, Adrian and Liu, Weiyang}, journal={arXiv preprint arXiv:2309.12284}, year={2023} } - 数据源名称:HuggingFaceTB/cosmopedia 数据源描述:针对金融相关推理与数据探索任务筛选并重构的数据集。 数据源链接:https://huggingface.co/datasets/HuggingFaceTB/cosmopedia 许可证:Apache-2.0 bibtex @software{benallal2024cosmopedia, author = {Ben Allal, Loubna and Lozhkov, Anton and Penedo, Guilherme and Wolf, Thomas and von Werra, Leandro}, title = {Cosmopedia}, month = February, year = 2024, url = {https://huggingface.co/datasets/HuggingFaceTB/cosmopedia} } - 数据源名称:Josephgflowers/PII-NER 数据源链接:https://huggingface.co/datasets/Josephgflowers/PII-NER ## 伦理考量 - **用户隐私**:数据中的个人可识别信息均为合成生成。 - **专业建议声明**:数据集输出不构成经认证的金融指导意见。 ## 局限性 - **准确性**:模型输出需经领域专家验证后方可采信。 - **偏差问题**:覆盖范围因金融细分领域不同而存在差异。 - **多语言限制**:非英语内容仅局限于Aya数据集子集。 ## 数据集加载方式 python from datasets import load_dataset dataset = load_dataset("Josephgflowers/Phinance") print(dataset["train"][0])
提供机构:
maas
创建时间:
2025-08-31
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作