BeanCounter

Name: BeanCounter
Creator: 芝加哥大学
Published: 2024-09-26 21:26:46
License: 暂无描述

arXiv2024-09-26 更新2024-09-28 收录

下载链接：

https://huggingface.hub

下载链接

链接失效反馈

官方服务：

资源简介：

BeanCounter是由芝加哥大学创建的一个大规模、低毒性的商业文本数据集，包含超过1590亿个Tokens，主要来源于企业的公开披露文件。数据集通过EDGAR系统收集，经过清洗和去重处理，确保了数据的高质量和事实性。BeanCounter的应用领域主要集中在金融领域，旨在通过提供低毒性、高质量的数据来训练更优的金融领域语言模型，减少模型生成有害内容的风险。

BeanCounter is a large-scale, low-toxicity commercial text dataset developed by the University of Chicago. It contains over 159 billion tokens, primarily sourced from publicly disclosed corporate filings. The dataset is collected via the EDGAR system, and undergoes cleaning and deduplication processes to ensure high data quality and factual accuracy. Its main application focuses on the financial domain, aiming to train superior financial domain language models by providing low-toxicity, high-quality data, thereby reducing the risk of the models generating harmful content.

提供机构：

芝加哥大学

创建时间：

2024-09-26

搜集汇总

数据集介绍

构建方式

BeanCounter数据集通过从EDGAR系统中提取公开的企业披露文件构建，涵盖了超过1590亿个标记。构建过程包括四个主要步骤：收集所有EDGAR文件、提取文本、清理提取的文本以及去重处理。每个步骤均遵循近年来开发的最佳实践。最终的清理、欺诈和去重数据集分别包含约1590亿、3080万和1110亿个标记。

特点

BeanCounter数据集具有以下显著特点：首先，它是迄今为止最大的公开企业文本数据集，规模远超依赖类似来源的其他数据集。其次，数据集内容具有时效性、事实性和高质量，每条内容均附带精确到秒的时间戳，便于研究语言模型中的时间因素。此外，数据集在毒性和人口统计身份分析方面表现出较低的毒性水平，相较于其他基于网络的数据集，其内容围绕这些身份的描述显著减少。

使用方法

BeanCounter数据集可用于训练和微调大型语言模型，特别是在金融领域。通过在BeanCounter上继续预训练模型，可以显著减少模型生成有毒内容的可能性，并提高其在金融应用中的表现。数据集的低毒性和高质量特性使其成为训练多参数语言模型的理想选择，尤其是在需要减少偏见和提高事实准确性的场景中。

背景与挑战

背景概述

BeanCounter, introduced by Siyan Wang and Bradford Levy from the University of Chicago, is a pioneering dataset comprising over 159 billion tokens extracted from business disclosures. Launched in 2024, this dataset represents a significant advancement in the field of language modeling, particularly in the domain of business-oriented text. BeanCounter is unique in its scale, novelty, and focus on factual accuracy and low toxicity, making it an invaluable resource for training large language models (LLMs). The dataset's creation addresses the growing need for high-quality, large-scale datasets that can effectively train LLMs, while mitigating concerns related to data quality, bias, and toxicity commonly associated with web-sourced datasets.

当前挑战

The development of BeanCounter faced several challenges, including the need to ensure the dataset's factual accuracy and low toxicity, which are critical for business-oriented applications. The extraction process from public domain business disclosures required meticulous cleaning and deduplication to maintain data quality. Additionally, the dataset had to be constructed in a way that minimized the inclusion of personally identifiable information and false or biased content. Another significant challenge was demonstrating the utility of BeanCounter through rigorous evaluation and comparison with existing LLMs, particularly in reducing toxic generation and improving performance within the finance domain. These challenges underscore the complexity and importance of creating a dataset that meets the stringent requirements of both academia and industry in the field of language modeling.

常用场景

经典使用场景

BeanCounter数据集的经典使用场景在于其作为低毒性、大规模且开放的商业文本数据集，特别适用于训练多亿参数的语言模型（LLMs）。由于其数据来源于企业的公开披露，这些文本相较于基于网页抓取的数据集，具有更高的真实性和较低的毒性。因此，BeanCounter常被用于持续预训练（continual pre-training）语言模型，以提升其在金融领域的性能，同时减少生成有害内容的可能性。

实际应用

在实际应用中，BeanCounter数据集被广泛用于金融领域的自然语言处理任务，如情感分类和命名实体识别。其低毒性特性使得训练出的模型在生成文本时更少出现有害内容，这对于需要高信任度的金融应用尤为重要。此外，由于数据集的高质量和高真实性，模型在处理金融文本时表现更为准确和可靠，从而提升了金融分析和决策的效率。

衍生相关工作

基于BeanCounter数据集，研究者们开展了一系列相关工作，特别是在金融领域的语言模型预训练和微调。例如，有研究通过持续预训练基于BeanCounter的模型，显著提升了模型在金融短语银行（Financial Phrasebank）和金融命名实体识别（Fin NER）任务中的表现。此外，BeanCounter还激发了对低毒性数据集的进一步探索，推动了在其他领域构建类似高质量、低毒性数据集的研究。

以上内容由遇见数据集搜集并总结生成