five

VynFi/vynfi-journal-entries-1m

收藏
Hugging Face2026-04-14 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/VynFi/vynfi-journal-entries-1m
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 task_categories: - tabular-classification tags: - synthetic - financial-data - vynfi - journal-entries - audit-analytics - benford - fraud-detection size_categories: - 1M<n<10M --- # VynFi Journal Entries: 2.1M Line Items with Fraud Labels 2,106,112 journal entry line items from 200,384 documents. Manufacturing sector, 10 companies, 12 monthly periods. Each row is a single debit or credit posting with the parent document's header fields merged in. 40 columns after dropping sparse fields. Key properties: - 6.92% fraud rate (revenue fraud, vendor kickback, payroll ghost, management override) - 21% manual entries (ISA 240 relevant) - Double-entry: total debits and credits are within 0.25% of each other across the full dataset - GL account codes follow a standard manufacturing chart of accounts - Native float amounts (no string conversion needed) This is not a curated research dataset. It is a raw generation output with known limitations: some header fields are sparse (approval workflows, SOD conflicts), OCPM case IDs are mostly null, and the debit/credit imbalance reflects anomaly-injected fraud entries that are intentionally unbalanced. ## Columns (40) Header fields: `company_code`, `fiscal_year`, `fiscal_period`, `posting_date`, `document_date`, `document_type`, `currency`, `business_process`, `is_fraud`, `fraud_type`, `is_manual`, `is_anomaly`, `is_elimination`, `is_post_close`, `sod_violation`, `sox_relevant`, `source`, `user_persona`, `created_by`, `approved_by`, `control_status`, `ledger` Line fields: `line_gl_account`, `line_debit_amount`, `line_credit_amount`, `line_description`, `line_cost_center`, `line_profit_center`, `line_line_number`, `line_document_id`, `line_tax_code`, `line_quantity`, `line_uom`, `line_assignment`, `line_reference` ## Quick Start ```python from datasets import load_dataset ds = load_dataset("VynFi/vynfi-journal-entries-1m", split="train") df = ds.to_pandas() # Fraud entries fraud = df[df["is_fraud"] == True] print(f"Fraud: {len(fraud)} / {len(df)} ({len(fraud)/len(df)*100:.1f}%)") # Benford first-digit test amounts = df["line_debit_amount"][df["line_debit_amount"] > 0] digits = amounts.apply(lambda x: int(str(abs(x)).lstrip("0.")[0])) print(digits.value_counts(normalize=True).sort_index()) ``` ## Limitations - Fraud labels are injected, not discovered. The ground truth is known by construction. - Amounts use native floats. Precision beyond 2 decimal places is not guaranteed. - The dataset is generated, not sampled from real ledgers. Statistical properties approximate but do not replicate any specific company. - 24 columns were dropped for having >50% null values (approval workflows, batch IDs, SOD conflict details). ## Citation ```bibtex @dataset{ivertowski_vynfi_je_2026, title = {VynFi Journal Entries: 2.1M Line Items with Fraud Labels}, author = {Michael Ivertowski}, year = {2026}, url = {https://huggingface.co/datasets/VynFi/vynfi-journal-entries-1m}, note = {Generated with VynFi (https://vynfi.com)} } ``` License: Apache 2.0. Entirely synthetic.
提供机构:
VynFi
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作