adwaith06/indic-synthetic-profiles
收藏Hugging Face2026-03-29 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/adwaith06/indic-synthetic-profiles
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- hi
- ml
- ta
- te
- bn
- kn
- gu
- mr
- en
license: mit
size_categories:
- 1K<n<10K
task_categories:
- text-generation
- tabular-classification
- tabular-regression
tags:
- synthetic
- india
- indian-names
- aadhaar
- pan
- faker
- indic
- kyc
- fraud-detection
- test-data
- multilingual
- hindi
- malayalam
- tamil
- bengali
- telugu
- kannada
- gujarati
- marathi
pretty_name: "Indian Synthetic Profiles (indic-faker)"
source_datasets:
- original
---
# 🇮🇳 Indian Synthetic Identity Dataset
<div align="center">
**10,000 realistic Indian synthetic identities across 8 languages — generated by [indic-faker](https://github.com/adwaith-0/indic-faker)**
[](https://pypi.org/project/indic-faker/)
[](https://github.com/adwaith-0/indic-faker)
[](https://github.com/adwaith-0/indic-faker/blob/main/LICENSE)
</div>
## Dataset Description
This dataset contains **10,000 rows** of realistic, synthetic Indian identity data generated using the [indic-faker](https://github.com/adwaith-0/indic-faker) Python library. Every record is algorithmically valid — Aadhaar numbers pass Verhoeff checksum verification, GSTINs have correct state codes, and names are culturally authentic across 8 Indian languages.
### Why This Dataset?
Indian AI development suffers from a critical gap: most synthetic data libraries generate Western-centric data (`"John Smith, 123 Main St"`). This dataset provides **India-first synthetic data** with:
- Names in **Hindi, Malayalam, Tamil, Telugu, Bengali, Kannada, Gujarati, and Marathi** (both native script and Latin transliteration)
- **Algorithm-validated** Indian ID numbers (Aadhaar, PAN, GSTIN)
- **State-aware** addresses with real pincodes
- Indian financial data (UPI IDs, IFSC codes, INR amounts in lakhs/crores)
- Realistic employment data (Indian companies, salary in LPA, IIT/NIT colleges)
## Dataset Structure
### Columns (23 fields)
| Column | Type | Description | Example |
|--------|------|-------------|---------|
| `name` | string | Full name (Latin script) | Rajesh Krishnan |
| `name_native` | string | Full name (native Indic script) | രാജേഷ് കൃഷ്ണൻ |
| `gender` | string | male / female | male |
| `dob` | string | Date of birth (DD/MM/YYYY) | 15/08/1990 |
| `age` | int | Age in years | 35 |
| `language` | string | ISO 639-1 language code | ml |
| `aadhaar` | string | Aadhaar number (Verhoeff ✓) | 3847 2918 4721 |
| `pan` | string | PAN number | ABCPK1234F |
| `phone` | string | Mobile number (+91) | +91 94471 82931 |
| `email` | string | Email address | rajesh.k@gmail.com |
| `address` | string | Full address with pincode | TC 14/2341, Pettah, TVM - 695024 |
| `city` | string | City name | Thiruvananthapuram |
| `state` | string | Indian state | Kerala |
| `pincode` | string | Valid 6-digit pincode | 695024 |
| `bank_account_ifsc` | string | IFSC code | SBIN0001234 |
| `bank_account_account` | string | Account number | 38291847291 |
| `bank_account_bank` | string | Bank name | SBI |
| `upi_id` | string | UPI ID | rajesh.k@okicici |
| `employer` | string | Indian company/employer | Infosys |
| `job_title` | string | Job title | Senior Software Engineer |
| `salary` | string | Salary in LPA | ₹12.5 LPA |
| `college` | string | Indian college/university | IIT Bombay |
| `degree` | string | Academic degree | B.Tech |
### Languages Represented
| Code | Language | Script | Approx. % of Dataset |
|:----:|:---------|:-------|:---------------------|
| `hi` | Hindi | देवनागरी | ~12.5% |
| `ml` | Malayalam | മലയാളം | ~12.5% |
| `ta` | Tamil | தமிழ் | ~12.5% |
| `te` | Telugu | తెలుగు | ~12.5% |
| `bn` | Bengali | বাংলা | ~12.5% |
| `kn` | Kannada | ಕನ್ನಡ | ~12.5% |
| `gu` | Gujarati | ગુજરાતી | ~12.5% |
| `mr` | Marathi | मराठी | ~12.5% |
## Use Cases
### 🔍 Fraud Detection Model Training
Train ML models to detect fraudulent KYC submissions, synthetic identity fraud, and anomalous transaction patterns using realistic Indian financial data.
### 🤖 LLM Fine-Tuning
Fine-tune language models on Indian names, addresses, and multilingual text. Build chatbots and NLP systems that understand Indian identity formats.
### ✅ KYC System Testing
Test Know Your Customer (KYC) verification systems with structurally valid Aadhaar, PAN, and GSTIN numbers without using real PII.
### 📊 Data Pipeline Testing
Stress-test ETL pipelines, data validation rules, and database schemas with realistic Indian data at scale.
### 🎓 Education & Research
Use for academic research on Indian demographic patterns, NLP tasks involving Indic scripts, and data science coursework.
### 📱 Application Prototyping
Populate Indian fintech, e-commerce, and HR application prototypes with realistic demo data.
## How to Load
```python
from datasets import load_dataset
dataset = load_dataset("adwaith06/indic-synthetic-profiles")
df = dataset["train"].to_pandas()
print(df.head())
print(f"Rows: {len(df)}, Columns: {len(df.columns)}")
```
## How to Generate More
Want 100K rows? 1M rows? Custom fields? Generate your own with [indic-faker](https://github.com/adwaith-0/indic-faker):
```bash
pip install indic-faker[ml]
```
```python
from indic_faker import IndicFaker
fake = IndicFaker()
# Generate 100,000 rows as a pandas DataFrame
df = fake.to_dataframe(100_000)
df.to_csv("my_dataset.csv", index=False)
# Or generate with specific fields only
df = fake.to_dataframe(50_000, fields=["name", "name_native", "aadhaar", "phone", "city"])
```
## Ethical Considerations
⚠️ **This is 100% synthetic data.** No real individuals are represented. However:
- Aadhaar and PAN numbers are **structurally valid** (pass checksum verification) but are **randomly generated** and do not belong to real people
- Names are drawn from common Indian name pools and do not represent real individuals
- Addresses use real city/state/pincode combinations but house numbers are fictional
- **Do not use this data to impersonate real individuals or commit fraud**
## Citation
```bibtex
@misc{indicfaker2026,
title={indic-faker: Generate Realistic Indian Synthetic Data},
author={Adwai},
year={2026},
publisher={HuggingFace},
url={https://github.com/adwaith-0/indic-faker}
}
```
## License
MIT License — free for everyone, forever.
提供机构:
adwaith06



