uaytug/fumea-dataset
收藏Hugging Face2026-02-28 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/uaytug/fumea-dataset
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
license: apache-2.0
task_categories:
- text-generation
- question-answering
- text-classification
tags:
- finance
- tool-use
- function-calling
- qwen3
- sft
- financial-analysis
- sentiment-analysis
- ner
- sec-filings
- fumea
size_categories:
- 100K<n<1M
pretty_name: "FUMEA Dataset — Financial & Tool-Use SFT Corpus"
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
- split: validation
path: data/validation-*
---
# FUMEA Dataset
**FUMEA-Dataset** is a merged, curated, and deduplicated corpus designed for Supervised Fine-Tuning (SFT) of large language models. It unifies two specialized domains — **tool-use / function-calling** and **financial analysis** — into a single, training-ready resource. All samples are pre-formatted with the **Qwen3 chat template** (`<|im_start|>` / `<|im_end|>`) and require no additional preprocessing.
This dataset is the primary training resource behind the [FUMEA-F model family](https://huggingface.co/uaytug/fumea-f), which combines financial reasoning with robust tool-use capabilities.
## Key Features
- **Training-ready**: Every sample is pre-tokenized into Qwen3 ChatML format — load and train immediately.
- **Two-phase SFT design**: Categories map directly to a two-phase curriculum (tool-use mastery → financial specialization).
- **Anti-forgetting replay buffer**: The `finance` split includes a tool-use replay subset from `tool use`, preventing catastrophic forgetting of function-calling skills during Phase 2.
- **Deduplicated & cleaned**: All source datasets were merged, deduplicated, and quality-filtered before formatting.
## Dataset Summary
| | Train | Validation | Total |
|---|---:|---:|---:|
| **`tool-use`** | 66,684 | 3,510 | 70,194 |
| **`finance`** | 306,957 | 16,156 | 323,113 |
| **Total** | **373,641** | **19,666** | **393,307** |
## Categories
### `tool-use` — Function-Calling & API Interaction
Training data for building robust tool-use and function-calling capabilities. Models trained on this subset learn to select appropriate tools, format structured API calls, and interpret tool responses within multi-turn conversations.
**Capabilities covered**: single-turn and multi-turn function calling, tool selection from candidate lists, parameter extraction, structured JSON output, error handling in tool responses.
### `finance` — Financial Analysis & Reasoning
A comprehensive financial NLP corpus covering multiple sub-tasks. This subset enables models to perform sophisticated financial reasoning while retaining tool-use skills through an integrated replay buffer.
**Sub-tasks included**:
- **Sentiment Analysis** — Classifying financial text (news, tweets, reports) as bullish, bearish, or neutral
- **Question Answering** — Answering questions grounded in financial documents and reports
- **SEC Filing Comprehension** — Extracting and reasoning over structured regulatory filings
- **Named Entity Recognition (NER)** — Identifying financial entities (tickers, companies, instruments, monetary values)
- **General Financial Reasoning** — Multi-step inference over financial scenarios and data
- **Tool-Use Replay Buffer** — A stratified subset from FUMEA-TU mixed in to prevent catastrophic forgetting
## Data Format
Each sample is a single text field containing a complete Qwen3 ChatML conversation, plus a `category` field for filtering.
```
Fields:
- text (string): Full ChatML-formatted conversation
- category (string): "tool-use" or "finance"
```
**Example structure** (simplified):
```
<|im_start|>system
You are a helpful financial assistant with access to the following tools: ...
<|im_end|>
<|im_start|>user
What is the current P/E ratio for AAPL?
<|im_end|>
<|im_start|>assistant
<tool_call>{"name": "get_stock_metrics", "arguments": {"ticker": "AAPL", "metric": "pe_ratio"}}</tool_call>
<|im_end|>
...
```
## Usage
### Loading the Full Dataset
```python
from datasets import load_dataset
dataset = load_dataset("uaytug/fumea-dataset")
print(dataset)
# DatasetDict({
# train: Dataset({features: ['text', 'category'], num_rows: 373641}),
# validation: Dataset({features: ['text', 'category'], num_rows: 19666})
# })
```
### Filtering by Category
```python
# Phase 1: Tool-use only
tool_use_data = dataset.filter(lambda x: x["category"] == "tool-use")
# Phase 2: Financial analysis (includes replay buffer)
finance_data = dataset.filter(lambda x: x["category"] == "finance")
```
### Two-Phase Training Pipeline
This dataset is designed for a curriculum learning approach:
```python
# Phase 1 — Tool-Use Mastery
phase1_train = dataset["train"].filter(lambda x: x["category"] == "tool-use")
# Train until tool-use accuracy > 80%
# Phase 2 — Financial Specialization
phase2_train = dataset["train"].filter(lambda x: x["category"] == "finance")
# The finance split already contains a tool-use replay buffer,
# so no additional mixing is required.
```
## Intended Use
- **Primary**: Supervised fine-tuning of Qwen3-based models for financial AI applications
- **Compatible architectures**: Any model supporting the ChatML / Qwen3 chat template
- **Recommended base models**: Qwen3-8B, Qwen3-4B, or similar
- **Training frameworks**: Unsloth, HuggingFace TRL/SFTTrainer, Axolotl
## Limitations & Biases
- **English only** — All data is in English. Financial terminology and regulatory content (e.g., SEC filings) is US-centric.
- **Synthetic & curated sources** — Tool-use data originates from synthetic generation pipelines (xLAM, Hermes). While high-quality, it may not cover all real-world API edge cases.
- **Point-in-time financial knowledge** — Financial facts in the dataset reflect their original collection dates and should not be treated as current market data.
- **No investment advice** — Models trained on this dataset are not intended to provide financial advice. Outputs should always be reviewed by qualified professionals.
## Citation
If you use this dataset in your research or projects, please cite:
```bibtex
@misc{fumea-dataset-2026,
author = {uaytug},
title = {FUMEA Dataset: A Unified Financial Analysis and Tool-Use SFT Corpus},
year = {2026},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/datasets/uaytug/fumea-dataset}}
}
```
## Related Resources
| Resource | Link |
|---|---|
| FUMEA-F Model (Dense) v2 | [uaytug/fumea-f-dense-v2](https://huggingface.co/uaytug/fumea-f-dense-v2) |
| FUMEA-F Model (Dense) | [uaytug/fumea-f-dense](https://huggingface.co/uaytug/fumea-f-dense) |
## License
This dataset is released under the [Apache 2.0 License](https://www.apache.org/licenses/LICENSE-2.0). Individual source datasets may carry their own licensing terms — please refer to the original repositories for details.
语言:
- en
许可证:Apache-2.0
任务类别:
- 文本生成
- 问答
- 文本分类
标签:
- 金融
- 工具使用
- 函数调用
- qwen3
- SFT
- 金融分析
- 情感分析
- NER
- SEC文件
- fumea
规模类别:
- 100K < n < 1M
友好名称:"FUMEA数据集 — 金融与工具使用监督微调语料库"
配置项:
- 配置名称:default
数据文件:
- 拆分方式:训练集
路径:data/train-*
- 拆分方式:验证集
路径:data/validation-*
# FUMEA数据集
**FUMEA数据集**是一个经过合并、整理与去重的语料库,专为大语言模型(Large Language Model,LLM)的监督微调(Supervised Fine-Tuning,SFT)设计。它将两个专业领域——**工具使用/函数调用**与**金融分析**——整合为单一的可直接用于训练的资源。所有样本均已按照**Qwen3对话模板**(`<|im_start|>` / `<|im_end|>`)完成预格式化,无需额外预处理。
本数据集是[FUMEA-F模型系列](https://huggingface.co/uaytug/fumea-f)的核心训练资源,该系列模型融合了金融推理与强大的工具使用能力。
## 核心特性
- **即训即用**:所有样本均已预标记为Qwen3 ChatML格式,可直接加载并开始训练。
- **两阶段微调设计**:数据集类别直接对应两阶段学习流程(工具使用精通 → 金融领域专精)。
- **防遗忘回放缓冲区**:`finance`拆分包含来自`tool-use`的工具使用回放子集,可防止第二阶段训练中出现函数调用技能的灾难性遗忘。
- **去重与清洗**:所有源数据集在格式化前均已完成合并、去重与质量过滤。
## 数据集概览
| | 训练集 | 验证集 | 总计 |
|---|---:|---:|---:|
| **`tool-use`** | 66,684 | 3,510 | 70,194 |
| **`finance`** | 306,957 | 16,156 | 323,113 |
| **总计** | **373,641** | **19,666** | **393,307** |
## 数据集类别
### `tool-use` — 函数调用与API交互
用于构建强大工具使用与函数调用能力的训练数据。在该子集上训练的模型可学习选择合适工具、格式化结构化API调用,并在多轮对话中解读工具返回结果。
**覆盖能力**:单轮与多轮函数调用、从候选列表中选择工具、参数提取、结构化JSON输出、工具返回结果的错误处理。
### `finance` — 金融分析与推理
覆盖多个子任务的综合性金融自然语言处理语料库。该子集可使模型在保留工具使用技能的同时,执行复杂的金融推理,其内置的回放缓冲区可避免技能遗忘。
**包含的子任务**:
- **情感分析**:将金融文本(新闻、推文、研报)分类为看涨、看跌或中性
- **问答**:基于金融文档与研报回答相关问题
- **SEC文件理解**:对结构化监管文件进行提取与推理
- **命名实体识别(Named Entity Recognition,NER)**:识别金融实体(股票代码、公司、金融工具、货币价值)
- **通用金融推理**:针对金融场景与数据进行多步推断
- **工具使用回放缓冲区**:混合自FUMEA-TU的分层子集,用于防止灾难性遗忘
## 数据格式
每个样本包含一个完整的Qwen3 ChatML对话的单文本字段,以及一个用于筛选的`category`字段。
字段说明:
- text(字符串):完整的ChatML格式对话
- category(字符串):取值为"tool-use"或"finance"
**示例结构(简化版)**:
<|im_start|>system
You are a helpful financial assistant with access to the following tools: ...
<|im_end|>
<|im_start|>user
What is the current P/E ratio for AAPL?
<|im_end|>
<|im_start|>assistant
<tool_call>{"name": "get_stock_metrics", "arguments": {"ticker": "AAPL", "metric": "pe_ratio"}}</tool_call>
<|im_end|>
...
## 使用方法
### 加载完整数据集
python
from datasets import load_dataset
dataset = load_dataset("uaytug/fumea-dataset")
print(dataset)
# DatasetDict({
# train: Dataset({features: ['text', 'category'], num_rows: 373641}),
# validation: Dataset({features: ['text', 'category'], num_rows: 19666})
# })
### 按类别筛选
python
# 第一阶段:仅工具使用数据
tool_use_data = dataset.filter(lambda x: x["category"] == "tool-use")
# 第二阶段:金融分析(包含回放缓冲区)
finance_data = dataset.filter(lambda x: x["category"] == "finance")
### 两阶段训练流程
本数据集专为课程式学习设计:
python
# 第一阶段 — 工具使用精通
phase1_train = dataset["train"].filter(lambda x: x["category"] == "tool-use")
# 训练至工具使用准确率超过80%
# 第二阶段 — 金融领域专精
phase2_train = dataset["train"].filter(lambda x: x["category"] == "finance")
# finance拆分已包含工具使用回放缓冲区,无需额外混合数据。
## 预期用途
- **核心用途**:针对金融AI应用的Qwen3基模型的监督微调
- **兼容架构**:任何支持ChatML / Qwen3对话模板的模型
- **推荐基础模型**:Qwen3-8B、Qwen3-4B或同类模型
- **训练框架**:Unsloth、HuggingFace TRL/SFTTrainer、Axolotl
## 局限性与偏倚
- **仅支持英语**:所有数据均为英文。金融术语与监管内容(如SEC文件)以美国市场为中心。
- **合成与整理数据源**:工具使用数据源自合成生成流水线(xLAM、Hermes)。尽管质量较高,但可能未覆盖所有真实世界的API边缘场景。
- **时点性金融知识**:数据集中的金融事实反映其原始收集日期,不应视为当前市场数据。
- **不提供投资建议**:基于本数据集训练的模型不应用于提供金融建议,输出需经合格专业人士审核。
## 引用
若您在研究或项目中使用本数据集,请引用:
bibtex
@misc{fumea-dataset-2026,
author = {uaytug},
title = {"FUMEA Dataset: A Unified Financial Analysis and Tool-Use SFT Corpus"},
year = {2026},
publisher = {Hugging Face},
howpublished = {url{https://huggingface.co/datasets/uaytug/fumea-dataset}}
}
## 相关资源
| 资源 | 链接 |
|---|---|
| FUMEA-F密集模型v2 | [uaytug/fumea-f-dense-v2](https://huggingface.co/uaytug/fumea-f-dense-v2) |
| FUMEA-F密集模型 | [uaytug/fumea-f-dense](https://huggingface.co/uaytug/fumea-f-dense) |
## 许可证
本数据集采用[Apache 2.0许可证](https://www.apache.org/licenses/LICENSE-2.0)发布。各源数据集可能携带自身的许可条款,请参阅原始仓库获取详细信息。
提供机构:
uaytug



