haidar-ali/tallyformer-finance-dataset

Name: haidar-ali/tallyformer-finance-dataset
Creator: haidar-ali
Published: 2025-12-16 08:03:24
License: 暂无描述

Hugging Face2025-12-16 更新2025-12-20 收录

下载链接：

https://hf-mirror.com/datasets/haidar-ali/tallyformer-finance-dataset

下载链接

链接失效反馈

官方服务：

资源简介：

该数据集是为训练TallyFormer-Finance-51M模型而准备的金融领域专用数据集，包含预训练数据、蒸馏数据和监督微调数据三部分。数据经过清洗、过滤和标记化处理，以Apache Parquet格式存储，便于高效训练和采样。数据集来源于Falcon-RefinedWeb、PES2O、SlimPajama和Finance-Alpaca等公开数据集，并经过特定预处理以适应金融语言理解任务。预训练数据用于持续预训练和通用语言建模，蒸馏数据用于从gpt2-medium进行知识蒸馏，监督微调数据用于金融领域的指令调优。

This dataset is a specialized financial language dataset prepared for training the TallyFormer-Finance-51M model, consisting of three parts: pretraining data, distillation data, and supervised fine-tuning data. The data has been cleaned, filtered, and tokenized, stored in Apache Parquet format for efficient training and sampling. The dataset is derived from public datasets such as Falcon-RefinedWeb, PES2O, SlimPajama, and Finance-Alpaca, with specific preprocessing to adapt to financial language understanding tasks. The pretraining data is used for continual pretraining and general language modeling, the distillation data is for knowledge distillation from gpt2-medium, and the supervised fine-tuning data is for instruction tuning in the finance domain.

提供机构：

haidar-ali

5,000+

优质数据集

54 个

任务类型

进入经典数据集