VynFi/vynfi-journal-entries-1m
收藏Hugging Face2026-04-14 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/VynFi/vynfi-journal-entries-1m
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
task_categories:
- tabular-classification
tags:
- synthetic
- financial-data
- vynfi
- journal-entries
- audit-analytics
- benford
- fraud-detection
size_categories:
- 1M<n<10M
---
# VynFi Journal Entries: 2.1M Line Items with Fraud Labels
2,106,112 journal entry line items from 200,384 documents. Manufacturing sector, 10 companies, 12 monthly periods.
Each row is a single debit or credit posting with the parent document's header fields merged in. 40 columns after dropping sparse fields. Key properties:
- 6.92% fraud rate (revenue fraud, vendor kickback, payroll ghost, management override)
- 21% manual entries (ISA 240 relevant)
- Double-entry: total debits and credits are within 0.25% of each other across the full dataset
- GL account codes follow a standard manufacturing chart of accounts
- Native float amounts (no string conversion needed)
This is not a curated research dataset. It is a raw generation output with known limitations: some header fields are sparse (approval workflows, SOD conflicts), OCPM case IDs are mostly null, and the debit/credit imbalance reflects anomaly-injected fraud entries that are intentionally unbalanced.
## Columns (40)
Header fields: `company_code`, `fiscal_year`, `fiscal_period`, `posting_date`, `document_date`, `document_type`, `currency`, `business_process`, `is_fraud`, `fraud_type`, `is_manual`, `is_anomaly`, `is_elimination`, `is_post_close`, `sod_violation`, `sox_relevant`, `source`, `user_persona`, `created_by`, `approved_by`, `control_status`, `ledger`
Line fields: `line_gl_account`, `line_debit_amount`, `line_credit_amount`, `line_description`, `line_cost_center`, `line_profit_center`, `line_line_number`, `line_document_id`, `line_tax_code`, `line_quantity`, `line_uom`, `line_assignment`, `line_reference`
## Quick Start
```python
from datasets import load_dataset
ds = load_dataset("VynFi/vynfi-journal-entries-1m", split="train")
df = ds.to_pandas()
# Fraud entries
fraud = df[df["is_fraud"] == True]
print(f"Fraud: {len(fraud)} / {len(df)} ({len(fraud)/len(df)*100:.1f}%)")
# Benford first-digit test
amounts = df["line_debit_amount"][df["line_debit_amount"] > 0]
digits = amounts.apply(lambda x: int(str(abs(x)).lstrip("0.")[0]))
print(digits.value_counts(normalize=True).sort_index())
```
## Limitations
- Fraud labels are injected, not discovered. The ground truth is known by construction.
- Amounts use native floats. Precision beyond 2 decimal places is not guaranteed.
- The dataset is generated, not sampled from real ledgers. Statistical properties approximate but do not replicate any specific company.
- 24 columns were dropped for having >50% null values (approval workflows, batch IDs, SOD conflict details).
## Citation
```bibtex
@dataset{ivertowski_vynfi_je_2026,
title = {VynFi Journal Entries: 2.1M Line Items with Fraud Labels},
author = {Michael Ivertowski},
year = {2026},
url = {https://huggingface.co/datasets/VynFi/vynfi-journal-entries-1m},
note = {Generated with VynFi (https://vynfi.com)}
}
```
License: Apache 2.0. Entirely synthetic.
提供机构:
VynFi



