Par-Four-Fineweb-Edu-Fortified-Finance
收藏魔搭社区2025-12-05 更新2025-09-06 收录
下载链接:
https://modelscope.cn/datasets/Josephgflowers/Par-Four-Fineweb-Edu-Fortified-Finance
下载链接
链接失效反馈官方服务:
资源简介:
Subset of https://huggingface.co/datasets/Josephgflowers/Par-Four-Fineweb-Edu-Fortified
Used keyword filtering and scoring to extract data with a finance focus.
Creation script:
https://huggingface.co/datasets/Josephgflowers/Par-Four-Fineweb-Edu-Fortified-Finance/resolve/main/find-fin-fine.py

## Dataset Description
### Summary
This dataset is a finance-focused filtered subset of the Fineweb-Edu-Fortified dataset. It is designed to extract and prioritize high-quality educational content relevant to finance, economics, and related topics. Using advanced keyword-based scoring, the dataset emphasizes finance-specific entries while reducing the overall dataset size for more targeted training.
The dataset includes three key fields:
- **score**: Represents the quality and relevance of the text content.
- **text**: The main content of the webpage, filtered for finance-related keywords.
- **url**: The source URL from which the text was extracted.
### Source and Reference
The original dataset, Fineweb-Edu-Fortified, was created from 95 Common Crawl datasets, covering web content from 2013 to 2024. This filtered version maintains its core educational purpose while refining the focus to finance-related materials.
The original dataset card is available [here](https://huggingface.co/datasets/airtrain-ai/fineweb-edu-fortified).
## Supported Tasks and Use Cases
1. **Financial Model Fine-Tuning**
- Train smaller language models for finance-specific applications such as market analysis, economic research, and financial QA.
2. **Synthetic Dataset Creation**
- Generate question-answer pairs and financial problem-solving datasets using extracted content.
3. **Topic-Based Training**
- Focus on specialized finance-related training by grouping the dataset by subject.
4. **Model Healing**
- Fine-tune pruned models to restore or enhance knowledge in financial and economic domains.
## Dataset Structure
### Data Fields
- **score** (int): Quality score representing relevance to finance and economics.
- **text** (string): The main content of the webpage.
- **url** (string): The source URL.
### Format
The dataset is pre-filtered and scored using regex-based financial keywords. It retains only entries meeting a financial relevance threshold.
## Languages
The dataset is predominantly in **English**, but some entries may include multilingual content due to the source data.
## License
This dataset is released under the [Open Data Commons Attribution License (ODC-By) v1.0](https://opendatacommons.org/licenses/by/1.0/), in accordance with the original dataset’s licensing.
### Help Here
Like my work? Want to see more? Custom request? Message me on discord: joseph.flowers.ra Donate here: https://buymeacoffee.com/josephgflowers
## Citation
If you use this dataset, please cite the original Fineweb-Edu-Fortified dataset:
```bibtex
@dataset{airtrain2024finewebedu,
title={Fineweb-Edu-Fortified},
author={Airtrain AI},
year={2024},
url={https://huggingface.co/datasets/airtrain-ai/fineweb-edu-fortified}
}
本数据集为 https://huggingface.co/datasets/Josephgflowers/Par-Four-Fineweb-Edu-Fortified 的子集。
采用关键词过滤与评分机制,提取聚焦金融领域的数据。
创建脚本:
https://huggingface.co/datasets/Josephgflowers/Par-Four-Fineweb-Edu-Fortified-Finance/resolve/main/find-fin-fine.py

## 数据集说明
### 摘要
本数据集是Fineweb-Edu-Fortified数据集的金融领域过滤子集,旨在提取并优先收录与金融、经济及相关主题相关的高质量教育类内容。通过先进的基于关键词的评分机制,该数据集重点聚焦金融专属条目,同时压缩整体数据集规模,以支持更精准的模型训练。
本数据集包含三个核心字段:
- **score(评分)**:表征文本内容的质量与相关性。
- **text(文本)**:网页的主体内容,已针对金融相关关键词完成过滤。
- **url(来源链接)**:提取文本的原始来源URL。
### 来源与参考
原始数据集Fineweb-Edu-Fortified基于95个通用爬取数据集(Common Crawl)构建,涵盖2013年至2024年的网络内容。本过滤版本保留了其核心的教育属性,同时将聚焦范围细化至金融相关材料。
原始数据集卡片可参见[此处](https://huggingface.co/datasets/airtrain-ai/fineweb-edu-fortified)。
## 支持任务与应用场景
1. **金融模型微调**
- 针对金融特定应用训练轻量化大语言模型,例如市场分析、经济研究与金融问答。
2. **合成数据集构建**
- 利用提取的内容生成问答对与金融问题求解数据集。
3. **主题式训练**
- 通过按主题分组数据集,聚焦于专业金融相关的模型训练。
4. **模型修复**
- 对剪枝后的模型进行微调,以恢复或增强其在金融与经济领域的知识储备。
## 数据集结构
### 数据字段
- **score** (int):表征文本与金融、经济主题相关性的质量评分,数据类型为整数。
- **text** (string):网页主体内容,数据类型为字符串。
- **url** (string):原始来源链接,数据类型为字符串。
### 数据格式
本数据集通过基于正则表达式的金融关键词完成过滤与评分,仅保留达到金融相关性阈值的条目。
## 语言分布
本数据集主要以**英语**为主,但由于源数据的特性,部分条目可能包含多语言内容。
## 许可证
本数据集遵循[开放数据通用署名许可证(Open Data Commons Attribution License, ODC-By)v1.0](https://opendatacommons.org/licenses/by/1.0/),与原始数据集的许可协议保持一致。
### 相关支持
喜欢我的工作?希望获取更多内容?或有定制需求?请在Discord上联系我:joseph.flowers.ra。捐赠通道:https://buymeacoffee.com/josephgflowers
## 引用说明
若您使用本数据集,请引用原始Fineweb-Edu-Fortified数据集:
bibtex
@dataset{airtrain2024finewebedu,
title={Fineweb-Edu-Fortified},
author={Airtrain AI},
year={2024},
url={https://huggingface.co/datasets/airtrain-ai/fineweb-edu-fortified}
}
提供机构:
maas
创建时间:
2025-08-31



