five

Par-Four-Fineweb-Edu-Fortified-Finance

收藏
魔搭社区2025-12-05 更新2025-09-06 收录
下载链接:
https://modelscope.cn/datasets/Josephgflowers/Par-Four-Fineweb-Edu-Fortified-Finance
下载链接
链接失效反馈
官方服务:
资源简介:
Subset of https://huggingface.co/datasets/Josephgflowers/Par-Four-Fineweb-Edu-Fortified Used keyword filtering and scoring to extract data with a finance focus. Creation script: https://huggingface.co/datasets/Josephgflowers/Par-Four-Fineweb-Edu-Fortified-Finance/resolve/main/find-fin-fine.py ![image/webp](https://cdn-uploads.huggingface.co/production/uploads/6328952f798f8d122ce62a44/dJOGYzc9hdGl5C0HcEgqa.webp) ## Dataset Description ### Summary This dataset is a finance-focused filtered subset of the Fineweb-Edu-Fortified dataset. It is designed to extract and prioritize high-quality educational content relevant to finance, economics, and related topics. Using advanced keyword-based scoring, the dataset emphasizes finance-specific entries while reducing the overall dataset size for more targeted training. The dataset includes three key fields: - **score**: Represents the quality and relevance of the text content. - **text**: The main content of the webpage, filtered for finance-related keywords. - **url**: The source URL from which the text was extracted. ### Source and Reference The original dataset, Fineweb-Edu-Fortified, was created from 95 Common Crawl datasets, covering web content from 2013 to 2024. This filtered version maintains its core educational purpose while refining the focus to finance-related materials. The original dataset card is available [here](https://huggingface.co/datasets/airtrain-ai/fineweb-edu-fortified). ## Supported Tasks and Use Cases 1. **Financial Model Fine-Tuning** - Train smaller language models for finance-specific applications such as market analysis, economic research, and financial QA. 2. **Synthetic Dataset Creation** - Generate question-answer pairs and financial problem-solving datasets using extracted content. 3. **Topic-Based Training** - Focus on specialized finance-related training by grouping the dataset by subject. 4. **Model Healing** - Fine-tune pruned models to restore or enhance knowledge in financial and economic domains. ## Dataset Structure ### Data Fields - **score** (int): Quality score representing relevance to finance and economics. - **text** (string): The main content of the webpage. - **url** (string): The source URL. ### Format The dataset is pre-filtered and scored using regex-based financial keywords. It retains only entries meeting a financial relevance threshold. ## Languages The dataset is predominantly in **English**, but some entries may include multilingual content due to the source data. ## License This dataset is released under the [Open Data Commons Attribution License (ODC-By) v1.0](https://opendatacommons.org/licenses/by/1.0/), in accordance with the original dataset’s licensing. ### Help Here Like my work? Want to see more? Custom request? Message me on discord: joseph.flowers.ra Donate here: https://buymeacoffee.com/josephgflowers ## Citation If you use this dataset, please cite the original Fineweb-Edu-Fortified dataset: ```bibtex @dataset{airtrain2024finewebedu, title={Fineweb-Edu-Fortified}, author={Airtrain AI}, year={2024}, url={https://huggingface.co/datasets/airtrain-ai/fineweb-edu-fortified} }

本数据集为 https://huggingface.co/datasets/Josephgflowers/Par-Four-Fineweb-Edu-Fortified 的子集。 采用关键词过滤与评分机制,提取聚焦金融领域的数据。 创建脚本: https://huggingface.co/datasets/Josephgflowers/Par-Four-Fineweb-Edu-Fortified-Finance/resolve/main/find-fin-fine.py ![image/webp](https://cdn-uploads.huggingface.co/production/uploads/6328952f798f8d122ce62a44/dJOGYzc9hdGl5C0HcEgqa.webp) ## 数据集说明 ### 摘要 本数据集是Fineweb-Edu-Fortified数据集的金融领域过滤子集,旨在提取并优先收录与金融、经济及相关主题相关的高质量教育类内容。通过先进的基于关键词的评分机制,该数据集重点聚焦金融专属条目,同时压缩整体数据集规模,以支持更精准的模型训练。 本数据集包含三个核心字段: - **score(评分)**:表征文本内容的质量与相关性。 - **text(文本)**:网页的主体内容,已针对金融相关关键词完成过滤。 - **url(来源链接)**:提取文本的原始来源URL。 ### 来源与参考 原始数据集Fineweb-Edu-Fortified基于95个通用爬取数据集(Common Crawl)构建,涵盖2013年至2024年的网络内容。本过滤版本保留了其核心的教育属性,同时将聚焦范围细化至金融相关材料。 原始数据集卡片可参见[此处](https://huggingface.co/datasets/airtrain-ai/fineweb-edu-fortified)。 ## 支持任务与应用场景 1. **金融模型微调** - 针对金融特定应用训练轻量化大语言模型,例如市场分析、经济研究与金融问答。 2. **合成数据集构建** - 利用提取的内容生成问答对与金融问题求解数据集。 3. **主题式训练** - 通过按主题分组数据集,聚焦于专业金融相关的模型训练。 4. **模型修复** - 对剪枝后的模型进行微调,以恢复或增强其在金融与经济领域的知识储备。 ## 数据集结构 ### 数据字段 - **score** (int):表征文本与金融、经济主题相关性的质量评分,数据类型为整数。 - **text** (string):网页主体内容,数据类型为字符串。 - **url** (string):原始来源链接,数据类型为字符串。 ### 数据格式 本数据集通过基于正则表达式的金融关键词完成过滤与评分,仅保留达到金融相关性阈值的条目。 ## 语言分布 本数据集主要以**英语**为主,但由于源数据的特性,部分条目可能包含多语言内容。 ## 许可证 本数据集遵循[开放数据通用署名许可证(Open Data Commons Attribution License, ODC-By)v1.0](https://opendatacommons.org/licenses/by/1.0/),与原始数据集的许可协议保持一致。 ### 相关支持 喜欢我的工作?希望获取更多内容?或有定制需求?请在Discord上联系我:joseph.flowers.ra。捐赠通道:https://buymeacoffee.com/josephgflowers ## 引用说明 若您使用本数据集,请引用原始Fineweb-Edu-Fortified数据集: bibtex @dataset{airtrain2024finewebedu, title={Fineweb-Edu-Fortified}, author={Airtrain AI}, year={2024}, url={https://huggingface.co/datasets/airtrain-ai/fineweb-edu-fortified} }
提供机构:
maas
创建时间:
2025-08-31
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作