金融数据数据集|文本数据集数据集

库帕思2025-12-22 更新2025-12-27 收录

下载链接：

https://www.kupasai.com/corpus/detail?id=679&type=1

下载链接

链接失效反馈

官方服务：

资源简介：

BeanCounter是由芝加哥大学构建的大型商业文本数据集，包含超1590亿个tokens，源自美国SEC的EDGAR系统中企业披露文件，如年报、信用协议等。数据经过清洗去重，具有低毒性和高专业性，涵盖财务、战略、风险等多维度商业信息，并附带时间戳与元数据。其规模大、质量高，适用于训练商业领域大语言模型，支持财报分析、金融预测、风险评估等任务，也可用于模型偏见评估与商业信息检索。

BeanCounter is a large-scale commercial text dataset constructed by the University of Chicago. It contains over 159 billion tokens, sourced from corporate disclosure filings in the U.S. Securities and Exchange Commission (SEC) EDGAR system, including annual reports, credit agreements and other relevant documents. The dataset has undergone cleaning and deduplication procedures, featuring low toxicity and high domain-specific professionalism, covering multi-dimensional business information spanning finance, strategy, risk and other fields, and is equipped with timestamps and metadata. With its large scale and high-quality attributes, it is suitable for training large language models (LLMs) in the commercial domain, supporting tasks such as financial report analysis, financial forecasting, risk assessment and more. It can also be utilized for model bias evaluation and business information retrieval.

提供机构：

库帕思

创建时间：

2025-12-18