five

A structured database of Indonesian corporate financial statements with section-level and line-item embeddings: coal-mining and palm-oil plantation companies (2023)

收藏
Mendeley Data2026-07-04 收录
下载链接:
https://data.mendeley.com/datasets/ys3zy3yw5y
下载链接
链接失效反馈
官方服务:
资源简介:
This data article presents a structured database derived from the audited annual financial statements of 60 Indonesian listed companies for the 2023 financial year, comprising 30 coal-mining and 30 palm-oil plantation firms on the Indonesia Stock Exchange (IDX). Each PDF report was parsed into a relational schema that separates document metadata, the hierarchy of note sections (Catatan atas Laporan Keuangan / CaLK), the full text of each note section, and the individual numerical line items of the balance sheet and income statement. The database contains 60 document records, 4,896 note-section records, 4,896 section-content records, and 8,752 financial line-item records. Every textual and numerical unit is accompanied by a 768-dimensional dense vector embedding computed with IndoBERT, enabling semantic search over both narrative disclosures and financial line items in Bahasa Indonesia. The data are released as relational tables (CSV/SQL) together with the parsing scripts and the field dictionary. Because the database preserves the parent-child relationship between notes, sub-notes, and the figures they explain, it provides a reusable resource for research on retrieval-augmented generation, table-text reasoning, financial information extraction, and disclosure analysis on semi-structured financial statements in an under-represented, non-English setting. It can be reused to build and benchmark document-analysis systems, to study disclosure patterns across the two sectors, or to be linked with firm-level market or tax data.
创建时间:
2026-06-22
二维码
社区交流群
二维码
科研交流群
商业服务