five

monadgeek/fenra

收藏
Hugging Face2026-03-16 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/monadgeek/fenra
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: mit task_categories: - text-generation - text-classification - question-answering language: - en tags: - procurement - fraud-detection - kenya - government-contracts - phi3 - instruction-tuning size_categories: - 100K<n<1M --- # Fenra Procurement Fraud Detection Dataset High-quality training data for fine-tuning LLMs on Kenyan government procurement fraud detection and contract analysis. ## Dataset Summary | Split | File | Records | Size | |-------|------|---------|------| | Contracts Train | `contracts/train.jsonl` | 14,400 | ~9 MB | | Contracts Validation | `contracts/validation.jsonl` | 800 | ~500 KB | | Contracts Test | `contracts/test.jsonl` | 800 | ~500 KB | | Suppliers | `suppliers/suppliers_training.jsonl` | 60,906 | ~25 MB | | Fraud Train | `fraud/train.jsonl` | 2,975 | ~2.5 MB | | Fraud Val | `fraud/val.jsonl` | 372 | ~300 KB | | Fraud Test | `fraud/test.jsonl` | 372 | ~300 KB | | Synthetic | `synthetic/synthetic_train.jsonl` | 438,967 | ~248 MB | | **Total** | | **~520K** | **~285 MB** | ## Schema All JSONL files use a consistent schema with these fields: - `instruction`: Task description for the model - `input`: Context/data for the task - `output`: Expected model response - `type`: Task category (e.g., `fraud_detection`, `contract_analysis`, `supplier_lookup`) - `source`: Data origin (e.g., `tenders.go.ke`, `synthetic`, `expanded`) ## Splits Hugging Face `datasets` library will auto-detect splits based on file names: - `train.jsonl` → `train` split - `validation.jsonl` or `val.jsonl` → `validation` split - `test.jsonl` → `test` split ## Usage ```python from datasets import load_dataset # Load full dataset (all splits) dataset = load_dataset("your-username/fenra-procurement-fraud") # Load specific split train = dataset["train"] # Or load specific directory contracts = load_dataset("your-username/fenra-procurement-fraud", data_dir="contracts") ``` ## Data Sources - **tenders.go.ke**: Official Kenyan procurement portal (OCDS API) - **PPDA 2005**: Public Procurement and Disposal Act - **EACC Records**: Ethics and Anti-Corruption Commission cases - **Synthetic Generation**: AI-generated training scenarios ## License MIT License - Free for research and model training
提供机构:
monadgeek
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作