five

adwaith06/indic-synthetic-profiles

收藏
Hugging Face2026-03-29 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/adwaith06/indic-synthetic-profiles
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - hi - ml - ta - te - bn - kn - gu - mr - en license: mit size_categories: - 1K<n<10K task_categories: - text-generation - tabular-classification - tabular-regression tags: - synthetic - india - indian-names - aadhaar - pan - faker - indic - kyc - fraud-detection - test-data - multilingual - hindi - malayalam - tamil - bengali - telugu - kannada - gujarati - marathi pretty_name: "Indian Synthetic Profiles (indic-faker)" source_datasets: - original --- # 🇮🇳 Indian Synthetic Identity Dataset <div align="center"> **10,000 realistic Indian synthetic identities across 8 languages — generated by [indic-faker](https://github.com/adwaith-0/indic-faker)** [![PyPI](https://img.shields.io/pypi/v/indic-faker?style=flat-square&logo=pypi&logoColor=white)](https://pypi.org/project/indic-faker/) [![GitHub](https://img.shields.io/github/stars/adwaith-0/indic-faker?style=flat-square&logo=github)](https://github.com/adwaith-0/indic-faker) [![License](https://img.shields.io/badge/license-MIT-green?style=flat-square)](https://github.com/adwaith-0/indic-faker/blob/main/LICENSE) </div> ## Dataset Description This dataset contains **10,000 rows** of realistic, synthetic Indian identity data generated using the [indic-faker](https://github.com/adwaith-0/indic-faker) Python library. Every record is algorithmically valid — Aadhaar numbers pass Verhoeff checksum verification, GSTINs have correct state codes, and names are culturally authentic across 8 Indian languages. ### Why This Dataset? Indian AI development suffers from a critical gap: most synthetic data libraries generate Western-centric data (`"John Smith, 123 Main St"`). This dataset provides **India-first synthetic data** with: - Names in **Hindi, Malayalam, Tamil, Telugu, Bengali, Kannada, Gujarati, and Marathi** (both native script and Latin transliteration) - **Algorithm-validated** Indian ID numbers (Aadhaar, PAN, GSTIN) - **State-aware** addresses with real pincodes - Indian financial data (UPI IDs, IFSC codes, INR amounts in lakhs/crores) - Realistic employment data (Indian companies, salary in LPA, IIT/NIT colleges) ## Dataset Structure ### Columns (23 fields) | Column | Type | Description | Example | |--------|------|-------------|---------| | `name` | string | Full name (Latin script) | Rajesh Krishnan | | `name_native` | string | Full name (native Indic script) | രാജേഷ് കൃഷ്ണൻ | | `gender` | string | male / female | male | | `dob` | string | Date of birth (DD/MM/YYYY) | 15/08/1990 | | `age` | int | Age in years | 35 | | `language` | string | ISO 639-1 language code | ml | | `aadhaar` | string | Aadhaar number (Verhoeff ✓) | 3847 2918 4721 | | `pan` | string | PAN number | ABCPK1234F | | `phone` | string | Mobile number (+91) | +91 94471 82931 | | `email` | string | Email address | rajesh.k@gmail.com | | `address` | string | Full address with pincode | TC 14/2341, Pettah, TVM - 695024 | | `city` | string | City name | Thiruvananthapuram | | `state` | string | Indian state | Kerala | | `pincode` | string | Valid 6-digit pincode | 695024 | | `bank_account_ifsc` | string | IFSC code | SBIN0001234 | | `bank_account_account` | string | Account number | 38291847291 | | `bank_account_bank` | string | Bank name | SBI | | `upi_id` | string | UPI ID | rajesh.k@okicici | | `employer` | string | Indian company/employer | Infosys | | `job_title` | string | Job title | Senior Software Engineer | | `salary` | string | Salary in LPA | ₹12.5 LPA | | `college` | string | Indian college/university | IIT Bombay | | `degree` | string | Academic degree | B.Tech | ### Languages Represented | Code | Language | Script | Approx. % of Dataset | |:----:|:---------|:-------|:---------------------| | `hi` | Hindi | देवनागरी | ~12.5% | | `ml` | Malayalam | മലയാളം | ~12.5% | | `ta` | Tamil | தமிழ் | ~12.5% | | `te` | Telugu | తెలుగు | ~12.5% | | `bn` | Bengali | বাংলা | ~12.5% | | `kn` | Kannada | ಕನ್ನಡ | ~12.5% | | `gu` | Gujarati | ગુજરાતી | ~12.5% | | `mr` | Marathi | मराठी | ~12.5% | ## Use Cases ### 🔍 Fraud Detection Model Training Train ML models to detect fraudulent KYC submissions, synthetic identity fraud, and anomalous transaction patterns using realistic Indian financial data. ### 🤖 LLM Fine-Tuning Fine-tune language models on Indian names, addresses, and multilingual text. Build chatbots and NLP systems that understand Indian identity formats. ### ✅ KYC System Testing Test Know Your Customer (KYC) verification systems with structurally valid Aadhaar, PAN, and GSTIN numbers without using real PII. ### 📊 Data Pipeline Testing Stress-test ETL pipelines, data validation rules, and database schemas with realistic Indian data at scale. ### 🎓 Education & Research Use for academic research on Indian demographic patterns, NLP tasks involving Indic scripts, and data science coursework. ### 📱 Application Prototyping Populate Indian fintech, e-commerce, and HR application prototypes with realistic demo data. ## How to Load ```python from datasets import load_dataset dataset = load_dataset("adwaith06/indic-synthetic-profiles") df = dataset["train"].to_pandas() print(df.head()) print(f"Rows: {len(df)}, Columns: {len(df.columns)}") ``` ## How to Generate More Want 100K rows? 1M rows? Custom fields? Generate your own with [indic-faker](https://github.com/adwaith-0/indic-faker): ```bash pip install indic-faker[ml] ``` ```python from indic_faker import IndicFaker fake = IndicFaker() # Generate 100,000 rows as a pandas DataFrame df = fake.to_dataframe(100_000) df.to_csv("my_dataset.csv", index=False) # Or generate with specific fields only df = fake.to_dataframe(50_000, fields=["name", "name_native", "aadhaar", "phone", "city"]) ``` ## Ethical Considerations ⚠️ **This is 100% synthetic data.** No real individuals are represented. However: - Aadhaar and PAN numbers are **structurally valid** (pass checksum verification) but are **randomly generated** and do not belong to real people - Names are drawn from common Indian name pools and do not represent real individuals - Addresses use real city/state/pincode combinations but house numbers are fictional - **Do not use this data to impersonate real individuals or commit fraud** ## Citation ```bibtex @misc{indicfaker2026, title={indic-faker: Generate Realistic Indian Synthetic Data}, author={Adwai}, year={2026}, publisher={HuggingFace}, url={https://github.com/adwaith-0/indic-faker} } ``` ## License MIT License — free for everyone, forever.
提供机构:
adwaith06
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作