eddmpython/dartlab-data

Name: eddmpython/dartlab-data
Creator: eddmpython
Published: 2026-04-11 19:28:05
License: 暂无描述

Hugging Face2026-04-11 更新2026-04-12 收录

下载链接：

https://hf-mirror.com/datasets/eddmpython/dartlab-data

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: mit task_categories: - table-question-answering - text-classification language: - ko - en tags: - finance - disclosure - dart - edgar - sec - xbrl - korea - financial-statements - corporate-filings - 전자공시 - 재무제표 - 사업보고서 - 한국 pretty_name: DartLab 전자공시 데이터 size_categories: - 1K<n<10K --- <div align="center"> <br> <img alt="DartLab" src="https://huggingface.co/datasets/eddmpython/dartlab-data/resolve/main/assets/logo.png" width="160"> <h3>DartLab Data</h3> <p><b>Structured company data from DART & EDGAR disclosure filings</b></p> <p> <a href="https://github.com/eddmpython/dartlab"><img src="https://img.shields.io/badge/GitHub-dartlab-ea4647?style=for-the-badge&labelColor=050811&logo=github&logoColor=white" alt="GitHub"></a> <a href="https://pypi.org/project/dartlab/"><img src="https://img.shields.io/pypi/v/dartlab?style=for-the-badge&color=ea4647&labelColor=050811&logo=pypi&logoColor=white" alt="PyPI"></a> <a href="https://eddmpython.github.io/dartlab/"><img src="https://img.shields.io/badge/Docs-GitHub_Pages-38bdf8?style=for-the-badge&labelColor=050811&logo=github-pages&logoColor=white" alt="Docs"></a> <a href="https://buymeacoffee.com/eddmpython"><img src="https://img.shields.io/badge/Sponsor-Buy_Me_A_Coffee-ffdd00?style=for-the-badge&labelColor=050811&logo=buy-me-a-coffee&logoColor=white" alt="Sponsor"></a> </p> </div> ## What is this? <img align="right" src="https://huggingface.co/datasets/eddmpython/dartlab-data/resolve/main/assets/avatar-study.png" width="120"> Pre-collected [Parquet](https://parquet.apache.org/) files from [DartLab](https://github.com/eddmpython/dartlab) — a Python library that turns DART (Korea) and EDGAR (US) disclosure filings into one structured company map. This dataset is the **data layer** behind DartLab. When you run `dartlab.Company("005930")`, the library automatically downloads the relevant parquet from this repo. ## Dataset Structure ``` dart/ ├── docs/ 2,547 companies ~8 GB disclosure text (sections, tables, markdown) ├── finance/ 2,744 companies ~586 MB financial statements (BS, IS, CF, XBRL) └── report/ 2,711 companies ~319 MB structured disclosure APIs (28 types) ``` Each file is one company: `{stockCode}.parquet` ### docs — Disclosure Text Full-text sections from annual/quarterly reports, parsed into structured blocks. | Column | Description | |--------|------------| | `rcept_no` | DART filing ID | | `rcept_date` | Filing date | | `stock_code` | Stock code | | `corp_name` | Company name | | `report_type` | Annual/quarterly report type | | `section_title` | Original section title | | `section_order` | Section ordering | | `content` | Section text (markdown) | | `blockType` | `text` / `table` / `heading` | | `year` | Filing year | ### finance — Financial Statements XBRL-based financial data from DART OpenAPI (`fnlttSinglAcntAll`). | Column | Description | |--------|------------| | `bsns_year` | Business year | | `reprt_code` | Report quarter code | | `stock_code` | Stock code | | `corp_name` | Company name | | `fs_div` | `CFS` (consolidated) / `OFS` (separate) | | `sj_div` | Statement type (BS/IS/CF/SCE) | | `account_id` | XBRL account ID | | `account_nm` | Account name (Korean) | | `thstrm_amount` | Current period amount | | `frmtrm_amount` | Prior period amount | | `bfefrmtrm_amount` | Two periods prior amount | ### report — Structured Disclosure APIs 28 DART API categories covering governance, compensation, shareholding, and more. | Column | Description | |--------|------------| | `apiType` | API category (e.g., `dividend`, `employee`, `executive`) | | `year` | Year | | `quarter` | Quarter | | `stockCode` | Stock code | | `corpCode` | DART corp code | | *(varies)* | Category-specific columns | **28 API types:** dividend, employee, executive, majorHolder, treasuryStock, capitalChange, auditOpinion, stockTotal, outsideDirector, corporateBond, and more. ## Usage <img align="right" src="https://huggingface.co/datasets/eddmpython/dartlab-data/resolve/main/assets/avatar-analyze.png" width="120"> ### With DartLab (recommended) ```bash pip install dartlab ``` ```python import dartlab c = dartlab.Company("005930") # Samsung Electronics c.sections # full company map (topic x period) c.BS # balance sheet c.ratios # financial ratios c.show("businessOverview") # narrative text # US companies work the same way us = dartlab.Company("AAPL") us.BS us.ratios ``` DartLab auto-downloads from this dataset. No manual download needed. ### Direct download ```python import polars as pl # Single file url = "https://huggingface.co/datasets/eddmpython/dartlab-data/resolve/main/dart/finance/005930.parquet" df = pl.read_parquet(url) ``` ```bash # wget wget https://huggingface.co/datasets/eddmpython/dartlab-data/resolve/main/dart/finance/005930.parquet ``` <img align="right" src="https://huggingface.co/datasets/eddmpython/dartlab-data/resolve/main/assets/avatar-discover.png" width="120"> ## Data Source - **DART** (Korea): [dart.fss.or.kr](https://dart.fss.or.kr) — Korea's electronic disclosure system operated by the Financial Supervisory Service - **EDGAR** (US): [sec.gov/edgar](https://www.sec.gov/edgar) — SEC's Electronic Data Gathering, Analysis, and Retrieval system All data is sourced from public government disclosure systems. Financial figures are preserved as-is from the original filings — no rounding, no estimation, no interpolation. ## Update Schedule This dataset is updated automatically via GitHub Actions (daily). Recent filings (last 7 days) are checked and collected incrementally. ## License MIT — same as [DartLab](https://github.com/eddmpython/dartlab). ## Support If DartLab is useful for your work, consider supporting the project: [![Buy Me A Coffee](https://img.shields.io/badge/Buy_Me_A_Coffee-Support-ffdd00?style=for-the-badge&labelColor=050811&logo=buy-me-a-coffee&logoColor=white)](https://buymeacoffee.com/eddmpython) - [GitHub Issues](https://github.com/eddmpython/dartlab/issues) — bug reports, feature requests - [Blog](https://eddmpython.github.io/dartlab/blog/) — 120+ articles on Korean disclosure analysis

提供机构：

eddmpython

搜集汇总

数据集介绍

构建方式

在金融信息披露领域，DartLab数据集通过系统化采集韩国DART与美国EDGAR两大官方披露平台的公开数据构建而成。其构建过程依托自动化流程，每日通过GitHub Actions增量更新近七日内的最新公司文件，确保数据的时效性与连续性。数据以Parquet格式存储，每家公司对应独立文件，涵盖文本披露、财务报表及结构化API三大模块，原始财务数值均保留官方申报原貌，未经任何修约或估算处理。

使用方法

使用该数据集时，研究者可通过DartLab Python库直接调用公司股票代码，自动下载并加载对应Parquet文件，实现无缝数据访问。数据集适用于表格问答、文本分类及财务分析等任务，用户可分别利用docs模块进行自然语言处理，通过finance模块执行财务指标计算，或借助report模块的多样化API开展公司治理与股权结构研究。其标准化格式便于与主流数据分析工具集成，支持从宏观趋势分析到微观企业洞察的多层次应用场景。

背景与挑战

背景概述

在金融信息学与自然语言处理交叉领域，企业电子化披露数据的结构化处理一直是关键研究议题。DartLab数据集由开源项目DartLab团队构建，旨在整合韩国DART系统与美国SEC EDGAR系统的企业公开披露文件，转化为统一的结构化公司图谱。该数据集涵盖了约2700家韩国公司与970家美国公司的财务报告、披露文本及结构化API数据，核心研究问题聚焦于如何从异构、多语言的官方披露文档中自动化提取并标准化财务与治理信息，以支持跨市场的公司金融分析、风险建模及智能问答系统。其通过提供高质量的Parquet格式数据层，显著降低了金融数据分析的技术门槛，为学术界与工业界提供了宝贵的多语言企业数据基准。

当前挑战

该数据集致力于解决企业披露文档的表格问答与文本分类任务，其核心领域挑战在于处理金融文本的复杂语义与多语言差异，例如准确解析XBRL财务标签、理解非结构化表格中的数值关系，以及跨韩国语与英语的术语对齐。在构建过程中，团队面临多重技术挑战：原始披露文档格式异构，需设计鲁棒的解析器以提取文本、表格及元数据；数据规模庞大且持续更新，要求建立高效的自动化采集与增量更新管道；同时，确保从DART与EDGAR系统获取的数据在财务指标与报告结构上保持一致性与准确性，亦需精细的工程化处理。

常用场景

经典使用场景

在金融信息处理领域，DartLab数据集为研究人员提供了结构化的企业披露文档，其经典使用场景在于支持表格问答与文本分类任务。通过整合韩国DART与美国EDGAR系统的公开数据，该数据集使得模型能够学习从复杂的财务报告中提取关键信息，例如识别财务报表中的特定条目或对披露章节进行自动归类，从而为自动化金融文档分析奠定基础。

解决学术问题

该数据集有效解决了金融自然语言处理中数据稀缺与结构异质性的学术挑战。通过提供标准化的XBRL财务数据与解析后的文本块，研究人员能够深入探究企业披露的语言模式与财务指标的关联性，推动领域内关于信息抽取、跨语言金融文本理解以及自动化审计方法的研究进展，对提升金融数据分析的准确性与效率具有显著意义。

实际应用

在实际应用中，DartLab数据集可直接服务于投资分析、风险管理和企业监管等场景。分析师可利用其结构化的财务数据与披露文本，快速构建公司财务画像，监测企业治理变动或进行同业比较；监管机构则可借助自动化工具，高效筛查海量披露文件中的异常信息，从而增强市场透明度与合规监督能力。

数据集最近研究