five

SEC Filings Data

收藏
arXiv2025-09-30 收录
下载链接:
https://huggingface.co/arcee-ai/Llama-3-SEC-Base
下载链接
链接失效反馈
官方服务:
资源简介:
该数据集包含了从证券交易委员会(SEC)公告中提取的纯文本数据,经过阅读、清理、筛选和存储等处理流程,确保了数据质量,以便为模型训练提供高质量输入。在处理过程中,我们使用了如boto3(用于与AWS S3交互)和trafilatura(用于文本提取)等库,确保只有合适的格式被纳入模型训练。该数据集规模较大,涵盖了大量的SEC公告。任务是对金融监管数据进行领域适应,并评估语言模型在金融监管数据上的性能表现。

This dataset comprises plain text data extracted from filings of the U.S. Securities and Exchange Commission (SEC). It has gone through processing workflows including reading, cleaning, filtering and storage to guarantee data quality, thereby providing high-quality inputs for model training. During the processing stage, libraries such as boto3 (for interfacing with AWS S3) and trafilatura (for text extraction) were utilized to ensure only properly formatted data is included for model training. This large-scale dataset covers a vast number of SEC filings. The task focuses on domain adaptation for financial regulatory data and evaluating the performance of language models on such datasets.
提供机构:
arcee-ai
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作