Financial fraud dataset of Chinese listed companies (2015-2020)
收藏DataCite Commons2025-04-27 更新2025-05-18 收录
下载链接:
https://www.scidb.cn/detail?dataSetId=b3579289c15c4a7f840f2279d3cd4574
下载链接
链接失效反馈官方服务:
资源简介:
The data is sourced from the CSMAR database, covering violation records of companies listed on the Shanghai and Shenzhen stock exchanges from 2015 to 2020, focusing on five types of financial fraud: fictitious profits, inflated assets, false records, material omissions, and inaccurate disclosures. After excluding financial firms, the fraud sample set includes 2,652 violation records from 1,226 companies. Additionally, 2,652 high-quality companies without fraud were selected from the CNRDS ESG rating database to form the non-fraud sample set. The dataset consists of two parts: 1) Structured data: The file "financial fraud dataset (structured data).xlsx" contains 5,304 records covering 43 fields, such as basic company information, financial indicators, structural indicators, and linguistic features of annual report texts. Field names are listed in Table 1. 2) Annual report text data: The folder named "Annual report text data" includes 2,652 fraud samples (file names formatted as Symbol-Year.txt) and 2,652 non-fraud samples (same format). The files contain the MD&A sections of listed companies' annual reports.
本数据集源自国泰安(CSMAR)数据库,覆盖2015年至2020年沪、深两市上市公司的违规记录,聚焦五类财务造假行为:虚构利润、虚增资产、虚假记载、重大遗漏及不准确披露。剔除金融类企业后,造假样本集包含1226家公司的2652条违规记录。此外,从中国研究数据服务平台(CNRDS)ESG评级数据库中选取2652家无造假行为的优质企业,构建非造假样本集。本数据集包含两部分:1)结构化数据:文件"financial fraud dataset (structured data).xlsx"包含5304条记录,涵盖43个字段,具体包括企业基本信息、财务指标、结构指标以及年报文本的语言特征,字段名称详见表1。2)年报文本数据:名为"Annual report text data"的文件夹包含2652份造假样本(文件名格式为"证券代码-年份.txt")与2652份非造假样本(格式一致),文件内容为上市公司年报中的管理层讨论与分析(MD&A)部分。
提供机构:
Science Data Bank
创建时间:
2025-04-17
搜集汇总
数据集介绍

背景与挑战
背景概述
该数据集包含2015-2020年中国沪深上市公司的财务欺诈数据,涵盖五种欺诈类型,分为结构化数据和年度报告文本两部分。结构化数据包含5,304条记录和43个字段,文本数据包括2,652个欺诈和非欺诈样本的年度报告文本。
以上内容由遇见数据集搜集并总结生成



