uaytug/fumea-dataset

Name: uaytug/fumea-dataset
Creator: uaytug
Published: 2026-02-28 23:05:33
License: 暂无描述

Hugging Face2026-02-28 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/uaytug/fumea-dataset

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - en license: apache-2.0 task_categories: - text-generation - question-answering - text-classification tags: - finance - tool-use - function-calling - qwen3 - sft - financial-analysis - sentiment-analysis - ner - sec-filings - fumea size_categories: - 100K<n<1M pretty_name: "FUMEA Dataset — Financial & Tool-Use SFT Corpus" configs: - config_name: default data_files: - split: train path: data/train-* - split: validation path: data/validation-* --- # FUMEA Dataset **FUMEA-Dataset** is a merged, curated, and deduplicated corpus designed for Supervised Fine-Tuning (SFT) of large language models. It unifies two specialized domains — **tool-use / function-calling** and **financial analysis** — into a single, training-ready resource. All samples are pre-formatted with the **Qwen3 chat template** (`<|im_start|>` / `<|im_end|>`) and require no additional preprocessing. This dataset is the primary training resource behind the [FUMEA-F model family](https://huggingface.co/uaytug/fumea-f), which combines financial reasoning with robust tool-use capabilities. ## Key Features - **Training-ready**: Every sample is pre-tokenized into Qwen3 ChatML format — load and train immediately. - **Two-phase SFT design**: Categories map directly to a two-phase curriculum (tool-use mastery → financial specialization). - **Anti-forgetting replay buffer**: The `finance` split includes a tool-use replay subset from `tool use`, preventing catastrophic forgetting of function-calling skills during Phase 2. - **Deduplicated & cleaned**: All source datasets were merged, deduplicated, and quality-filtered before formatting. ## Dataset Summary | | Train | Validation | Total | |---|---:|---:|---:| | **`tool-use`** | 66,684 | 3,510 | 70,194 | | **`finance`** | 306,957 | 16,156 | 323,113 | | **Total** | **373,641** | **19,666** | **393,307** | ## Categories ### `tool-use` — Function-Calling & API Interaction Training data for building robust tool-use and function-calling capabilities. Models trained on this subset learn to select appropriate tools, format structured API calls, and interpret tool responses within multi-turn conversations. **Capabilities covered**: single-turn and multi-turn function calling, tool selection from candidate lists, parameter extraction, structured JSON output, error handling in tool responses. ### `finance` — Financial Analysis & Reasoning A comprehensive financial NLP corpus covering multiple sub-tasks. This subset enables models to perform sophisticated financial reasoning while retaining tool-use skills through an integrated replay buffer. **Sub-tasks included**: - **Sentiment Analysis** — Classifying financial text (news, tweets, reports) as bullish, bearish, or neutral - **Question Answering** — Answering questions grounded in financial documents and reports - **SEC Filing Comprehension** — Extracting and reasoning over structured regulatory filings - **Named Entity Recognition (NER)** — Identifying financial entities (tickers, companies, instruments, monetary values) - **General Financial Reasoning** — Multi-step inference over financial scenarios and data - **Tool-Use Replay Buffer** — A stratified subset from FUMEA-TU mixed in to prevent catastrophic forgetting ## Data Format Each sample is a single text field containing a complete Qwen3 ChatML conversation, plus a `category` field for filtering. ``` Fields: - text (string): Full ChatML-formatted conversation - category (string): "tool-use" or "finance" ``` **Example structure** (simplified): ``` <|im_start|>system You are a helpful financial assistant with access to the following tools: ... <|im_end|> <|im_start|>user What is the current P/E ratio for AAPL? <|im_end|> <|im_start|>assistant <tool_call>{"name": "get_stock_metrics", "arguments": {"ticker": "AAPL", "metric": "pe_ratio"}}</tool_call> <|im_end|> ... ``` ## Usage ### Loading the Full Dataset ```python from datasets import load_dataset dataset = load_dataset("uaytug/fumea-dataset") print(dataset) # DatasetDict({ # train: Dataset({features: ['text', 'category'], num_rows: 373641}), # validation: Dataset({features: ['text', 'category'], num_rows: 19666}) # }) ``` ### Filtering by Category ```python # Phase 1: Tool-use only tool_use_data = dataset.filter(lambda x: x["category"] == "tool-use") # Phase 2: Financial analysis (includes replay buffer) finance_data = dataset.filter(lambda x: x["category"] == "finance") ``` ### Two-Phase Training Pipeline This dataset is designed for a curriculum learning approach: ```python # Phase 1 — Tool-Use Mastery phase1_train = dataset["train"].filter(lambda x: x["category"] == "tool-use") # Train until tool-use accuracy > 80% # Phase 2 — Financial Specialization phase2_train = dataset["train"].filter(lambda x: x["category"] == "finance") # The finance split already contains a tool-use replay buffer, # so no additional mixing is required. ``` ## Intended Use - **Primary**: Supervised fine-tuning of Qwen3-based models for financial AI applications - **Compatible architectures**: Any model supporting the ChatML / Qwen3 chat template - **Recommended base models**: Qwen3-8B, Qwen3-4B, or similar - **Training frameworks**: Unsloth, HuggingFace TRL/SFTTrainer, Axolotl ## Limitations & Biases - **English only** — All data is in English. Financial terminology and regulatory content (e.g., SEC filings) is US-centric. - **Synthetic & curated sources** — Tool-use data originates from synthetic generation pipelines (xLAM, Hermes). While high-quality, it may not cover all real-world API edge cases. - **Point-in-time financial knowledge** — Financial facts in the dataset reflect their original collection dates and should not be treated as current market data. - **No investment advice** — Models trained on this dataset are not intended to provide financial advice. Outputs should always be reviewed by qualified professionals. ## Citation If you use this dataset in your research or projects, please cite: ```bibtex @misc{fumea-dataset-2026, author = {uaytug}, title = {FUMEA Dataset: A Unified Financial Analysis and Tool-Use SFT Corpus}, year = {2026}, publisher = {Hugging Face}, howpublished = {\url{https://huggingface.co/datasets/uaytug/fumea-dataset}} } ``` ## Related Resources | Resource | Link | |---|---| | FUMEA-F Model (Dense) v2 | [uaytug/fumea-f-dense-v2](https://huggingface.co/uaytug/fumea-f-dense-v2) | | FUMEA-F Model (Dense) | [uaytug/fumea-f-dense](https://huggingface.co/uaytug/fumea-f-dense) | ## License This dataset is released under the [Apache 2.0 License](https://www.apache.org/licenses/LICENSE-2.0). Individual source datasets may carry their own licensing terms — please refer to the original repositories for details.

语言： - en 许可证：Apache-2.0 任务类别： - 文本生成 - 问答 - 文本分类标签： - 金融 - 工具使用 - 函数调用 - qwen3 - SFT - 金融分析 - 情感分析 - NER - SEC文件 - fumea 规模类别： - 100K < n < 1M 友好名称："FUMEA数据集 — 金融与工具使用监督微调语料库" 配置项： - 配置名称：default 数据文件： - 拆分方式：训练集路径：data/train-* - 拆分方式：验证集路径：data/validation-* # FUMEA数据集 **FUMEA数据集**是一个经过合并、整理与去重的语料库，专为大语言模型（Large Language Model，LLM）的监督微调（Supervised Fine-Tuning，SFT）设计。它将两个专业领域——**工具使用/函数调用**与**金融分析**——整合为单一的可直接用于训练的资源。所有样本均已按照**Qwen3对话模板**（`<|im_start|>` / `<|im_end|>`）完成预格式化，无需额外预处理。本数据集是[FUMEA-F模型系列](https://huggingface.co/uaytug/fumea-f)的核心训练资源，该系列模型融合了金融推理与强大的工具使用能力。 ## 核心特性 - **即训即用**：所有样本均已预标记为Qwen3 ChatML格式，可直接加载并开始训练。 - **两阶段微调设计**：数据集类别直接对应两阶段学习流程（工具使用精通 → 金融领域专精）。 - **防遗忘回放缓冲区**：`finance`拆分包含来自`tool-use`的工具使用回放子集，可防止第二阶段训练中出现函数调用技能的灾难性遗忘。 - **去重与清洗**：所有源数据集在格式化前均已完成合并、去重与质量过滤。 ## 数据集概览 | | 训练集 | 验证集 | 总计 | |---|---:|---:|---:| | **`tool-use`** | 66,684 | 3,510 | 70,194 | | **`finance`** | 306,957 | 16,156 | 323,113 | | **总计** | **373,641** | **19,666** | **393,307** | ## 数据集类别 ### `tool-use` — 函数调用与API交互用于构建强大工具使用与函数调用能力的训练数据。在该子集上训练的模型可学习选择合适工具、格式化结构化API调用，并在多轮对话中解读工具返回结果。 **覆盖能力**：单轮与多轮函数调用、从候选列表中选择工具、参数提取、结构化JSON输出、工具返回结果的错误处理。 ### `finance` — 金融分析与推理覆盖多个子任务的综合性金融自然语言处理语料库。该子集可使模型在保留工具使用技能的同时，执行复杂的金融推理，其内置的回放缓冲区可避免技能遗忘。 **包含的子任务**： - **情感分析**：将金融文本（新闻、推文、研报）分类为看涨、看跌或中性 - **问答**：基于金融文档与研报回答相关问题 - **SEC文件理解**：对结构化监管文件进行提取与推理 - **命名实体识别（Named Entity Recognition，NER）**：识别金融实体（股票代码、公司、金融工具、货币价值） - **通用金融推理**：针对金融场景与数据进行多步推断 - **工具使用回放缓冲区**：混合自FUMEA-TU的分层子集，用于防止灾难性遗忘 ## 数据格式每个样本包含一个完整的Qwen3 ChatML对话的单文本字段，以及一个用于筛选的`category`字段。字段说明： - text（字符串）：完整的ChatML格式对话 - category（字符串）：取值为"tool-use"或"finance" **示例结构（简化版）**： <|im_start|>system You are a helpful financial assistant with access to the following tools: ... <|im_end|> <|im_start|>user What is the current P/E ratio for AAPL? <|im_end|> <|im_start|>assistant <tool_call>{"name": "get_stock_metrics", "arguments": {"ticker": "AAPL", "metric": "pe_ratio"}}</tool_call> <|im_end|> ... ## 使用方法 ### 加载完整数据集 python from datasets import load_dataset dataset = load_dataset("uaytug/fumea-dataset") print(dataset) # DatasetDict({ # train: Dataset({features: ['text', 'category'], num_rows: 373641}), # validation: Dataset({features: ['text', 'category'], num_rows: 19666}) # }) ### 按类别筛选 python # 第一阶段：仅工具使用数据 tool_use_data = dataset.filter(lambda x: x["category"] == "tool-use") # 第二阶段：金融分析（包含回放缓冲区） finance_data = dataset.filter(lambda x: x["category"] == "finance") ### 两阶段训练流程本数据集专为课程式学习设计： python # 第一阶段 — 工具使用精通 phase1_train = dataset["train"].filter(lambda x: x["category"] == "tool-use") # 训练至工具使用准确率超过80% # 第二阶段 — 金融领域专精 phase2_train = dataset["train"].filter(lambda x: x["category"] == "finance") # finance拆分已包含工具使用回放缓冲区，无需额外混合数据。 ## 预期用途 - **核心用途**：针对金融AI应用的Qwen3基模型的监督微调 - **兼容架构**：任何支持ChatML / Qwen3对话模板的模型 - **推荐基础模型**：Qwen3-8B、Qwen3-4B或同类模型 - **训练框架**：Unsloth、HuggingFace TRL/SFTTrainer、Axolotl ## 局限性与偏倚 - **仅支持英语**：所有数据均为英文。金融术语与监管内容（如SEC文件）以美国市场为中心。 - **合成与整理数据源**：工具使用数据源自合成生成流水线（xLAM、Hermes）。尽管质量较高，但可能未覆盖所有真实世界的API边缘场景。 - **时点性金融知识**：数据集中的金融事实反映其原始收集日期，不应视为当前市场数据。 - **不提供投资建议**：基于本数据集训练的模型不应用于提供金融建议，输出需经合格专业人士审核。 ## 引用若您在研究或项目中使用本数据集，请引用： bibtex @misc{fumea-dataset-2026, author = {uaytug}, title = {"FUMEA Dataset: A Unified Financial Analysis and Tool-Use SFT Corpus"}, year = {2026}, publisher = {Hugging Face}, howpublished = {url{https://huggingface.co/datasets/uaytug/fumea-dataset}} } ## 相关资源 | 资源 | 链接 | |---|---| | FUMEA-F密集模型v2 | [uaytug/fumea-f-dense-v2](https://huggingface.co/uaytug/fumea-f-dense-v2) | | FUMEA-F密集模型 | [uaytug/fumea-f-dense](https://huggingface.co/uaytug/fumea-f-dense) | ## 许可证本数据集采用[Apache 2.0许可证](https://www.apache.org/licenses/LICENSE-2.0)发布。各源数据集可能携带自身的许可条款，请参阅原始仓库获取详细信息。

提供机构：

uaytug

5,000+

优质数据集

54 个

任务类型

进入经典数据集