five

TeraflopAI/SEC-EDGAR

收藏
Hugging Face2026-04-17 更新2026-05-10 收录
下载链接:
https://hf-mirror.com/datasets/TeraflopAI/SEC-EDGAR
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 task_categories: - text-generation - text-classification language: - en tags: - finance - edgar - sec size_categories: - 1M<n<10M --- [Datamule](https://datamule.xyz/), [Teraflop AI](https://www.teraflopai.com/), and [Eventual](https://www.eventual.ai/) collaborated to release the SEC-EDGAR dataset. ![Processing diagram](sec-edgar.png) The dataset contains 590 gbs of data, spanning 8 million samples and 43 billion tokens from all major filings in the SEC EDGAR database. The bulk data was collected using [datamule-python](https://github.com/john-friedman/datamule-python) library and the official [datamule api](https://datamule.xyz/) created by [John Friedman](https://john-friedman.github.io/). The datamule Python library is a package for collecting, manipulating, and processing the SEC Edgar data at scale. Datamule provides a simple open-source api interface to easily download each of a company's filings by ticker and submission type. SEC EDGAR rate limits at 10 requests per second. Constantly crawling 8 million major filings without network overhead takes over 10 days alone, following the official EDGAR guidance. The documentation for datamule can be found [here](https://john-friedman.github.io/datamule-python/). The dataset contains the raw contents of each major filing, the extracted and parsed HTML/XML plaintext, and relevant metadata such as the filing’s accession number, filing date, period, documents, and filer. The raw document contents are provided so that you may use your own custom parser to extract the HTML/XML to plaintext. The text was parsed and extracted from the HTML/XML contents using the [selectolax](https://selectolax.readthedocs.io/en/latest/index.html) HTML parser and a modified version of [doc2dict](https://github.com/john-friedman/doc2dict/tree/main) and [secsgml](https://github.com/john-friedman/secsgml) libraries. The SEC SGML library is used to parse through the [Standard Generalized Markup Language](https://en.wikipedia.org/wiki/Standard_Generalized_Markup_Language) document format used by the Securities and Exchange Commission and to handle [daily archive](http://sec.gov/Archives/edgar/Feed/) and [submission file types](https://www.sec.gov/Archives/edgar/data/1318605/000095017022000796/0000950170-22-000796.txt). The doc2dict library provides multiple parsers for extracting HTML, XML, and PDF content, and was used to convert to plaintext and explicitly handle table mappings. The documentation for [doc2dict](https://john-friedman.github.io/doc2dict/whitepaper/) can be found here. A total of 8 million individual filings were extracted with metadata. The document metadata contains the file type, sequence, filename, description, and number of SEC SGML bytes. The filer metadata contains the company name, Central Index Key, assigned Standard Industrial Classification Codes, IRS number, state of incorporation, fiscal year, act, file number, business address, and other relevant information. Samples per document type: | Filing | Total number of samples | | :---- | :---- | | Form 5 | 114,724 | | Form 4 | 4,474,981 | | Form 3 | 387,465 | | S-1 | 24,866 | | S-8 | 95,543 | | 10-K | 223,275 | | 8-K | 1,952,207 | | 20-F | 19,428 | | 10-Q | 674,240 | | 144 | 88,726 | | Total | 8,055,455 | To collect the total token counts of each filing, we used the [Comma v0.1 tokenizer](https://huggingface.co/common-pile/comma-v0.1-1t), a BPE-based tokenizer with a vocabulary size of 64,000. The dataset encompasses a total of 43 billion clean tokens for training LLMs and building retrieval pipelines. ![Total tokens chart](total_tokens_per_filing_millions.png) Total token counts for each filing: | Filing | Total token count | | :---- | :---- | | 10-K | 14,518,876,137 | | 20-F | 2,917,164,397 | | Form 5 | 66,330,315 | | Form 4 | 1,676,565,503 | | Form 3 | 110,098,014 | | 10-Q | 17,509,723,617 | | S-1 | 2,914,107,827 | | S-8 | 472,867,864 | | 8-K | 3,466,866,649 | | 144 | 73,218,304 | | Total | 43,725,818,627 | We are building open-source state-of-the-art search across numerous domains. If you would like to help support or contribute to future open-source projects and dataset releases, you can join our [Discord](https://discord.gg/bWW8Wbhxhx) or contact us directly [here](https://x.com/EnricoShippole). You can use Teraflop AI segmentation, embedding, and search APIs today for free. Sign up for the Teraflop AI API platform [here](https://platform.teraflopai.com/signup).
提供机构:
TeraflopAI
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作