一招金融数据集
收藏魔搭社区2026-01-07 更新2024-12-07 收录
下载链接:
https://modelscope.cn/datasets/HITsz-TMG/YiZhao
下载链接
链接失效反馈官方服务:
资源简介:
# ***哈尔滨工业大学&招商银行 一招 金融数据集说明***
[Arxiv][[一招-12B-Chat](https://modelscope.cn/models/CMB_AILab/YiZhao-12B-Chat)][[清洗工具](https://github.com/HITsz-TMG/FinPile)]
---
**一招数据集**是一个2TB高质量多模态的大模型训练数据集,致力于构建一个更金融、更干净且符合社会主义核心价值观的大规模金融领域数据集。
本数据集将不仅包含广泛的金融事件、市场动态,还涵盖各种金融产品和交易模式,以确保模型在复杂的金融环境中展现出卓越的泛化能力和预测准确性。
本数据集重视数据合规性,坚决维护数据隐私、保护商业机密并要求数据符合社会主义核心价值观。借助专业的数据清洗方法,确保在不侵犯用户权益的前提下,安全、合法地利用这些数据,进而驱动行业知识探索和智能决策能力的增强。
## ***开源概况***
我们使用同步开源的清洗工具、金融数据分类器和安全风险识别分类器对原始数据集进行处理后,构建了更干净、具备金融特色、符合社会主义核心价值观的中、英文数据集。
本次开源的数据集包含**936GB**中文文本数据集,**100GB**英文文本数据集和**1TB**的高质量多模态数据集。
文本数据集涵盖金融领域各方面的内容,数据最终处理成具有统一字段的jsonl格式,每个文件大小不超过7GB。我们保留了数据经金融数据分类器后的金融得分和经安全风险识别分类器后的得分,以便后续根据分数获取不同质量、规模的数据集。同时,我们对数据集进行了人工抽检,从数据集中随机抽取4000个样本,要求标注人员从数据是否包含违反社会主义核心价值观的内容、是否包含歧视性内容、是否涉及商业违法违规、是否侵犯他人合法权益、是否涉黄涉暴涉赌等方面进行标注。
多模态大模型在金融领域中具有重要的地位。金融领域是一个涉及大量数据和复杂数学模型的领域,因此需要使用多模态大模型来处理和分析各种类型的数据,例如文本、图像和音频等。这些模型的主要优点是能够处理多种类型的数据,并且在训练和推理过程中能够利用多种模态的信息,从而提高模型的准确性和鲁棒性。在金融领域中,多模态大模型可以用于解决许多实际问题,例如信用评分、风险管理、金融欺诈检测和投资组合优化等。例如,在信用评分中,多模态大模型可以同时考虑文本和图像等信息,从而更好地评估借款人的信用风险。在金融欺诈检测中,多模态大模型可以分析多种类型的数据,例如交易记录、网络流量和图像等,从而更好地检测欺诈行为。为扩充数据的多样性,我们抽取上述金融数据分类器过滤出来的高质量文本,进一步通过工程技术和算法策略,生成适用于模型训练的**1TB**的高质量多模态数据集。
## ***数据字段说明***
* "meta":【object】文本信息
"id": 【string】文本唯一的ID
"url": 【string】文本的原始页面的URL
"title": 【string】标题
"source_domain": 【string】来源网站的域名
"dump": 【string】文本所属的CommonCrawl快照
"fin_int_score":【int】金融分数四舍五入后的整数分数,分数范围为[1,5]
"fin_score_model":【string】金融特性得分所使用的模型版本
"risk_score":【float】安全风险识别分类器预测的分数
"risk_score_model":【string】安全风险识别得分所使用的模型版本
"language":【string】 en 英文, zh 中文
"images":【list】 文本对应图片存放路径
* "text":【string】文本
* "qa":【list】基于文本的QA对
## ***数据开放协议***
我们根据Open Data Commons Attribution License(ODC-by) 1.0许可证协议发布此数据集。使用此数据集,您还需要遵守原始数据源的任何许可协议和使用条款。
# ***Harbin Institute of Technology & China Merchants Bank Yizhao Financial Dataset Description***
[Arxiv][[YiZhao-12B-Chat](https://modelscope.cn/models/CMB_AILab/YiZhao-12B-Chat)][[Cleaning Tool](https://github.com/HITsz-TMG/FinPile)]
---
**Yizhao Dataset** is a 2TB high-quality multimodal large model training dataset dedicated to building a large-scale financial domain dataset that is more finance-focused, cleaner, and aligns with core socialist values.
This dataset covers not only a wide range of financial events and market trends, but also various financial products and trading modes, to ensure that the model exhibits excellent generalization ability and prediction accuracy in complex financial environments.
This dataset attaches great importance to data compliance, firmly safeguards data privacy and trade secrets, and requires data to align with core socialist values. With professional data cleaning methods, we ensure that user rights and interests are not infringed, and use these data safely and legally to drive industry knowledge exploration and the enhancement of intelligent decision-making capabilities.
## ***Open Source Overview***
We processed the original dataset using synchronously open-sourced cleaning tools, financial data classifiers, and security risk identification classifiers, and constructed a cleaner, finance-featured Chinese and English dataset that aligns with core socialist values.
The dataset open-sourced this time includes a **936GB** Chinese text dataset, a **100GB** English text dataset, and a **1TB** high-quality multimodal dataset.
The text dataset covers all aspects of the financial domain, and the data is finally processed into a unified-field jsonl format, with each file size not exceeding 7GB. We retain the financial score output by the financial data classifier and the security risk score output by the security risk identification classifier, enabling the acquisition of datasets with varying qualities and scales based on these scores in subsequent work. At the same time, we conducted manual random sampling inspection on the dataset: 4,000 samples were randomly selected from the dataset, and annotators were required to label the data from the following aspects: whether it contains content that violates core socialist values, whether it contains discriminatory content, whether it involves commercial violations and illegal activities, whether it infringes on others' legitimate rights and interests, and whether it involves pornography, violence, or gambling.
## ***Significance of Multimodal Large Models in the Financial Domain***
Multimodal large models play a critical role in the financial domain. The financial sector involves massive amounts of data and complex mathematical models, so it is necessary to use multimodal large models to process and analyze various types of data, such as text, images, audio, etc. The core advantages of these models lie in their capability to handle multiple data types and leverage information from multiple modalities during training and inference, thereby improving the model's accuracy and robustness. In the financial domain, multimodal large models can be applied to solve numerous practical problems, such as credit scoring, risk management, financial fraud detection, and portfolio optimization. For example, in credit scoring, multimodal large models can simultaneously consider information such as text and images to better assess the credit risk of borrowers. In financial fraud detection, multimodal large models can analyze multiple types of data, such as transaction records, network traffic, and images, to better detect fraudulent behavior. To expand the diversity of the dataset, we extracted high-quality text filtered by the aforementioned financial data classifier, and further generated a **1TB** high-quality multimodal dataset suitable for model training through engineering techniques and algorithmic strategies.
## ***Data Field Description***
* "meta": 【object】Text information
* "id": 【string】Unique ID of the text
* "url": 【string】URL of the original page of the text
* "title": 【string】Title
* "source_domain": 【string】Domain name of the source website
* "dump": 【string】CommonCrawl snapshot to which the text belongs
* "fin_int_score": 【int】Rounded integer financial score, with a score range of [1,5]
* "fin_score_model": 【string】Model version used for the financial feature score
* "risk_score": 【float】Score predicted by the security risk identification classifier
* "risk_score_model": 【string】Model version used for the security risk identification score
* "language": 【string】`en` for English, `zh` for Chinese
* "images": 【list】Storage path of images corresponding to the text
* "text": 【string】Text content
* "qa": 【list】QA pairs based on the text
## ***Data Open License***
We publish this dataset under the Open Data Commons Attribution License(ODC-by) 1.0. When using this dataset, you also need to comply with any license agreements and terms of use of the original data sources.
提供机构:
maas
创建时间:
2024-12-10
搜集汇总
数据集介绍

背景与挑战
背景概述
一招金融数据集是一个由哈尔滨工业大学和招商银行联合构建的2.28TB高质量多模态金融领域数据集,专注于训练大型模型以提升金融环境中的泛化和预测能力。该数据集包含936GB中文文本、100GB英文文本和1TB多模态数据,强调数据合规性和社会主义核心价值,通过金融分类和安全风险评分确保数据质量,适用于信用评分、风险管理等金融应用。
以上内容由遇见数据集搜集并总结生成



