monadgeek/fenra
收藏Hugging Face2026-03-16 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/monadgeek/fenra
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mit
task_categories:
- text-generation
- text-classification
- question-answering
language:
- en
tags:
- procurement
- fraud-detection
- kenya
- government-contracts
- phi3
- instruction-tuning
size_categories:
- 100K<n<1M
---
# Fenra Procurement Fraud Detection Dataset
High-quality training data for fine-tuning LLMs on Kenyan government procurement fraud detection and contract analysis.
## Dataset Summary
| Split | File | Records | Size |
|-------|------|---------|------|
| Contracts Train | `contracts/train.jsonl` | 14,400 | ~9 MB |
| Contracts Validation | `contracts/validation.jsonl` | 800 | ~500 KB |
| Contracts Test | `contracts/test.jsonl` | 800 | ~500 KB |
| Suppliers | `suppliers/suppliers_training.jsonl` | 60,906 | ~25 MB |
| Fraud Train | `fraud/train.jsonl` | 2,975 | ~2.5 MB |
| Fraud Val | `fraud/val.jsonl` | 372 | ~300 KB |
| Fraud Test | `fraud/test.jsonl` | 372 | ~300 KB |
| Synthetic | `synthetic/synthetic_train.jsonl` | 438,967 | ~248 MB |
| **Total** | | **~520K** | **~285 MB** |
## Schema
All JSONL files use a consistent schema with these fields:
- `instruction`: Task description for the model
- `input`: Context/data for the task
- `output`: Expected model response
- `type`: Task category (e.g., `fraud_detection`, `contract_analysis`, `supplier_lookup`)
- `source`: Data origin (e.g., `tenders.go.ke`, `synthetic`, `expanded`)
## Splits
Hugging Face `datasets` library will auto-detect splits based on file names:
- `train.jsonl` → `train` split
- `validation.jsonl` or `val.jsonl` → `validation` split
- `test.jsonl` → `test` split
## Usage
```python
from datasets import load_dataset
# Load full dataset (all splits)
dataset = load_dataset("your-username/fenra-procurement-fraud")
# Load specific split
train = dataset["train"]
# Or load specific directory
contracts = load_dataset("your-username/fenra-procurement-fraud", data_dir="contracts")
```
## Data Sources
- **tenders.go.ke**: Official Kenyan procurement portal (OCDS API)
- **PPDA 2005**: Public Procurement and Disposal Act
- **EACC Records**: Ethics and Anti-Corruption Commission cases
- **Synthetic Generation**: AI-generated training scenarios
## License
MIT License - Free for research and model training
提供机构:
monadgeek



