WaguyMZ/Financial_statements_fraud_dataset
收藏Hugging Face2025-11-28 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/WaguyMZ/Financial_statements_fraud_dataset
下载链接
链接失效反馈官方服务:
资源简介:
---
license: gpl
task_categories:
- text-classification
language:
- en
tags:
- finance
size_categories:
- 100M<n<1B
---
Official Dataset of the Paper : [Read Between the Lines: A Robust Financial Statement Fraud Detection
Framework](https://hal.science/file/index/docid/5375997/filename/anoymous-submission-with-appendices.pdf)
**Guy Stephane Waffo Dzuyo¹², Gael Guibon²³, Christophe Cerisara², Luis Belmar-Letelier¹**
¹ Forvis Mazars
² LORIA, CNRS, Université de Lorraine
³ Université Sorbonne Paris Nord, CNRS, Laboratoire d’Informatique de Paris Nord, LIPN, F-93430 Villetaneuse, France
**Emails:**
guy.stephane.waffo@forvismazars.com
gael.guibon@lipn.fr
christophe.cerisara@loria.fr
luis.belmar-letelier@forvismazars.com
Main Purpose of the Dataset : Supervised Anomaly Detection Task

The preprocessed dataset provided here includes :
- 17 863 Summarized quarterly MD&A reports using a self-hosted QWEN3 32B Model.
- 3 300 AAER reports
- 269 097 Quarterly Financial reports
- Final ready-to-use preprocessed datasets with distinct splitting strategies :
* Random Splitting
* Company-isolated splitting : Our paper demonstrates that the company-isolated setting is a more rigourous framework for the financial statement fraud detection task.
* Time splitting.
Each dataset comprises 5 folds and each folds comes with its *train.csv* and *test.csv*
If you need the entire raw dataset, please contact us at guywaffo@gmail.com
If you want contribute to the dataset improvement, feel free to open a thread in `Community` Section and let's discuss
提供机构:
WaguyMZ



