five

R1金融推理思维链数据集500K

收藏
魔搭社区2026-05-30 更新2025-02-22 收录
下载链接:
https://modelscope.cn/datasets/IngeniusAI/Finance_R1-Distill_Data
下载链接
链接失效反馈
官方服务:
资源简介:
# Finance R1 Distill Dataset 金融领域复杂问题思维链数据集项目。通过 Chain of Thought (CoT) 蒸馏方法从大规模语料中提取金融领域知识。 ## 数据来源 - 基于 Finance-Instruct-500k 金融领域指令对话数据集 - 通过 DeepSeek-R1 大模型进行思维链推理蒸馏 - 原始数据集包含超过50万条金融领域的高质量数据条目 - 数据持续更新中,当前蒸馏进度约 2385+ 条目 ## 原始数据特点 Finance-Instruct-500k 数据集整合了多个高质量金融数据集,包括: - 金融问答与推理 - 实体识别与情感分析 - 多轮对话与指令示例 - XBRL标记和命名实体识别 - 多语言自然语言处理任务 主要数据来源: - BAAI/IndustryInstruction_Finance-Economics - Josephgflowers/Financial-NER-NLP - Sujet-Finance-Instruct-177k - 其他金融领域高质量数据集 ## 数据格式 每条记录包含以下字段: ```json { "id": "唯一标识ID", "user_input": "原始金融问题", "reasoning_content": "大模型思维推理过程", "answer_r1": "最终回答结果", "created_by": "Ingenius_AI", "contact": "Ingenius AI 公众号" } ``` ## 技术方案 1. 数据预处理: - 从 JSON 输入文件加载原始金融问答数据 - 按批次(每1000条)进行处理以控制规模 2. 思维链蒸馏: - 使用 DeepSeek-R1 模型进行推理 - 对每个问题生成详细的推理过程(reasoning_content) - 生成最终答案(answer_r1) 3. 数据保存: - 使用 JSONL 格式保存处理结果 - 按批次分文件存储,便于管理大规模数据 ## 数据特点 - 聚焦金融领域专业问题 - 包含详细的推理思维链过程 - 由大模型蒸馏生成,质量可控 - 支持金融QA、推理、多轮对话等多种任务 - 持续更新扩充中 ## 下载方法 数据集文件元信息以及数据文件,请浏览"数据集文件"页面获取。 您可以通过如下方式下载数据集: :modelscope-code[]{type="sdk"} :modelscope-code[]{type="git"} ## 引用方式 如果您使用了本数据集,请按如下格式引用: ```bibtex @dataset{ingeniusai2025finance, title={Finance_R1-Distill_Data}, author={IngeniusAI}, year={2025}, publisher={ModelScope} } ``` ## 联系方式 关注「Ingenius AI」公众号获取最新项目进展。 ## License 该数据集基于 Apache License 2.0 协议开源,仅供学术研究使用。

# Finance R1 Distill Dataset This is a Chain of Thought (CoT) distillation dataset project focused on complex financial domain problems. Financial domain knowledge is extracted from large-scale corpora using CoT distillation methods. ## Data Sources - Based on the Finance-Instruct-500k financial domain instruction dialogue dataset - Conducted chain-of-thought reasoning distillation via the DeepSeek-R1 Large Language Model (LLM) - The original dataset contains over 500,000 high-quality financial data entries - The dataset is under continuous updates, with approximately 2,385+ distilled entries completed so far. ## Original Data Characteristics The Finance-Instruct-500k dataset integrates multiple high-quality financial datasets, including: - Financial QA and reasoning - Entity recognition and sentiment analysis - Multi-turn dialogue and instruction examples - XBRL tagging and named entity recognition - Multilingual natural language processing tasks Main data sources: - BAAI/IndustryInstruction_Finance-Economics - Josephgflowers/Financial-NER-NLP - Sujet-Finance-Instruct-177k - Other high-quality financial domain datasets ## Data Format Each record contains the following fields: json { "id": "Unique identifier ID", "user_input": "Original financial question", "reasoning_content": "LLM's chain-of-thought reasoning process", "answer_r1": "Final answer result", "created_by": "Ingenius_AI", "contact": "Ingenius AI Official Account" } ## Technical Framework 1. Data Preprocessing: - Load raw financial QA data from JSON input files - Process data in batches (1000 entries per batch) to control scale 2. Chain of Thought Distillation: - Use the DeepSeek-R1 LLM for reasoning - Generate detailed reasoning processes (reasoning_content) for each question - Generate final answers (answer_r1) 3. Data Storage: - Save processed results in JSONL format - Store data in batch-separated files for convenient management of large-scale datasets ## Dataset Characteristics - Focused on professional financial domain issues - Contains detailed chain-of-thought reasoning processes - Generated via LLM distillation with controllable quality - Supports various tasks including financial QA, reasoning, multi-turn dialogue, etc. - Under continuous expansion and update ## Download Method For metadata and data files of the dataset, please browse the "Dataset Files" page. You can download the dataset via the following methods: :modelscope-code[]{type="sdk"} :modelscope-code[]{type="git"} ## Citation If you use this dataset, please cite it in the following format: bibtex @dataset{ingeniusai2025finance, title={Finance_R1-Distill_Data}, author={IngeniusAI}, year={2025}, publisher={ModelScope} } ## Contact Information Follow the "Ingenius AI" official account for the latest project updates. ## License This dataset is open-sourced under the Apache License 2.0, for academic research use only.
提供机构:
maas
创建时间:
2025-02-13
搜集汇总
数据集介绍
main_image_url
背景与挑战
背景概述
该数据集是一个金融领域的思维链推理数据集,基于Finance-Instruct-500k金融指令对话数据,通过DeepSeek-R1大模型进行思维链蒸馏生成,包含超过50万条高质量条目(当前蒸馏进度约2385+条)。其特点在于专注于金融专业问题,提供详细的思维链推理过程和最终答案,支持金融问答、推理和多轮对话等任务,并持续更新,适用于学术研究。
以上内容由遇见数据集搜集并总结生成
二维码
社区交流群
二维码
科研交流群
商业服务