five

SEC

收藏
魔搭社区2025-12-05 更新2025-06-21 收录
下载链接:
https://modelscope.cn/datasets/PleIAs/SEC
下载链接
链接失效反馈
官方服务:
资源简介:
## SEC Annual Reports (Form 10-K) 1993-2024 ### Dataset Overview This dataset comprises SEC annual reports (Form 10-K) for the years 1993 to 2024, providing comprehensive coverage of publicly traded companies' financial and business information. The reports are stored in Parquet format, ensuring efficient storage and quick access. This dataset was meticulously compiled using the EDGAR-Crawler toolkit, which facilitates the extraction and processing of SEC filings from the EDGAR database. ### Dataset Structure #### Data Files The dataset is organized into separate Parquet files for each year, making it easy to navigate and utilize: - 1993.parquet - 1994.parquet - 1995.parquet - 1996.parquet - 1997.parquet - 1998.parquet - 1999.parquet - 2000.parquet - 2001.parquet - 2002.parquet - 2003.parquet - 2004.parquet - 2005.parquet - 2006.parquet - 2007.parquet - 2008.parquet - 2009.parquet - 2010.parquet - 2011.parquet - 2012.parquet - 2013.parquet - 2014.parquet - 2015.parquet - 2016.parquet - 2017.parquet - 2018.parquet - 2019.parquet - 2020.parquet - 2021.parquet - 2022.parquet - 2023.parquet - 2024.parquet ### Summary Statistics Across these years, the dataset encapsulates a total of 7,245,966,226 words spread over 245,211 entries, with an average of 34,324.36 words per entry. Notably, there are 4,043 documents with zero words, reflecting the occasional nature of filings that contain no textual content. ### Fields Included Each Parquet file contains detailed fields that provide a comprehensive view of each report: - `filename`: The filename of the filing document (e.g., "1089297_21929025_2004.htm"). - `id`: A unique identifier for the filing, formatted as "cik_year" (e.g., "1089297_2021"). - `year`: The year of the filing. - `cik`: The Central Index Key assigned to the company (e.g., "1089297"). - `text`: The full text of the filing. - `word_count`: The number of words in the filing text. - `character_count`: The number of characters in the filing text. ### Source and Methodology #### Source - Until 2020: The data have been collected from https://zenodo.org/records/5528490. - From 2021 to 2024: The data were collected using the EDGAR-Crawler toolkit, which facilitated the extraction and processing of SEC filings from the EDGAR database. #### Methodology 1. **Crawling**: The EDGAR-Crawler toolkit was utilized to download the 10-K filings for each specified year. 2. **Extraction and Cleaning**: The filings were extracted and cleaned to ensure a structured and clean dataset. 3. **Integration**: This dataset is seamlessly integrated with existing datasets from 1993 to 2020, providing a continuous and comprehensive record of SEC annual reports for extensive research and analysis. ### Use Cases This dataset is invaluable for various applications, including but not limited to: - **Academic Research**: Researchers in economics, finance, and business management can leverage this dataset to conduct detailed and expansive analyses, enhancing the scope and depth of their studies with robust financial data. - **Financial Analysis**: Professionals in finance can utilize the detailed reports to bolster financial analysis, strategic planning, and decision-making processes, ensuring well-informed and data-driven insights. - **NLP Applications**: The structured textual data in this dataset supports natural language processing (NLP) research and applications, enabling the development of advanced models and tools for financial document analysis and more. ### General Dataset Statistics - **Total number of words**: 7,245,966,226 - **Total number of entries**: 245,211 - **Average number of words per entry**: 34,324.36 - **Number of zero-word documents**: 4,043 ### Dataset Citation If you utilize this dataset in your research, please cite it as follows: ``` @dataset{SecAnnual, title={SEC Annual Reports (Form 10-K) 1993-2024}, author={Pleias}, year={2024}, description={Collection of SEC annual reports (Form 10-K) for the years 1993 to 2024} } ``` **Note:** This dataset is presented and maintained by Pleias. All rights reserved.

## SEC年度报告(Form 10-K)1993-2024 ### 数据集概览 本数据集包含1993年至2024年的美国证券交易委员会(SEC)年度报告(Form 10-K),全面覆盖上市公司的财务与经营信息。所有报告均以Parquet格式存储,可实现高效存储与快速访问。本数据集通过EDGAR-Crawler工具包精心编译完成,该工具包可便捷提取并处理EDGAR数据库中的SEC申报文件。 ### 数据集结构 #### 数据文件 本数据集按年度拆分为独立的Parquet文件,便于浏览与使用: - 1993.parquet - 1994.parquet - 1995.parquet - 1996.parquet - 1997.parquet - 1998.parquet - 1999.parquet - 2000.parquet - 2001.parquet - 2002.parquet - 2003.parquet - 2004.parquet - 2005.parquet - 2006.parquet - 2007.parquet - 2008.parquet - 2009.parquet - 2010.parquet - 2011.parquet - 2012.parquet - 2013.parquet - 2014.parquet - 2015.parquet - 2016.parquet - 2017.parquet - 2018.parquet - 2019.parquet - 2020.parquet - 2021.parquet - 2022.parquet - 2023.parquet - 2024.parquet ### 汇总统计 该数据集累计涵盖7245966226个单词,分布于245211条记录中,单条记录平均单词数为34324.36。值得注意的是,其中包含4043份零单词文档,这对应了部分无文本内容的申报文件的偶发现象。 ### 包含字段 每个Parquet文件均包含详细字段,可全面展现每份报告的相关信息: - `filename`:申报文件的文件名(示例:"1089297_21929025_2004.htm")。 - `id`:申报文件的唯一标识符,格式为"cik_year"(示例:"1089297_2021")。 - `year`:申报文件所属年份。 - `cik`:分配给上市公司的中央索引键(Central Index Key,简称CIK)(示例:"1089297")。 - `text`:申报文件的完整文本内容。 - `word_count`:申报文件文本的单词数量。 - `character_count`:申报文件文本的字符数量。 ### 来源与研究方法 #### 来源 - 2020年及之前:数据采集自https://zenodo.org/records/5528490。 - 2021年至2024年:数据通过EDGAR-Crawler工具包采集,该工具包可便捷提取并处理EDGAR数据库中的SEC申报文件。 #### 研究方法 1. **数据爬取**:使用EDGAR-Crawler工具包下载指定年度的10-K申报文件。 2. **文本提取与清洗**:对申报文件进行提取与清洗,以确保数据集结构规范、内容干净。 3. **数据整合**:本数据集与1993年至2020年的现有数据集完成无缝整合,可为各类研究与分析提供连续全面的SEC年度报告记录。 ### 应用场景 本数据集可广泛应用于多个场景,包括但不限于: - **学术研究**:经济学、金融学与企业管理领域的研究人员可借助该数据集开展细致深入的分析,依托丰富的财务数据拓展研究范围、深化研究深度。 - **金融分析**:金融行业从业者可利用详细的报告内容强化金融分析、战略规划与决策流程,助力生成基于数据的精准洞察。 - **自然语言处理(NLP)应用**:本数据集的结构化文本数据可支持自然语言处理(Natural Language Processing,简称NLP)相关研究与应用,助力开发面向金融文档分析的先进模型与工具等。 ### 通用数据集统计 - **总单词数**:7,245,966,226 - **总记录数**:245,211 - **单条记录平均单词数**:34,324.36 - **零单词文档数量**:4,043 ### 数据集引用 若您在研究中使用本数据集,请按以下格式进行引用: @dataset{SecAnnual, title={SEC Annual Reports (Form 10-K) 1993-2024}, author={Pleias}, year={2024}, description={Collection of SEC annual reports (Form 10-K) for the years 1993 to 2024} } **备注**:本数据集由Pleias发布并维护,保留所有权利。
提供机构:
maas
创建时间:
2025-06-19
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作