Indic-PersonaHub
收藏DataCite Commons2026-05-07 更新2026-05-18 收录
下载链接:
https://dataverse.harvard.edu/citation?persistentId=doi:10.7910/DVN/D3SGJK
下载链接
链接失效反馈官方服务:
资源简介:
Dataset Description: Indic Persona Hub (Representative Sample Release) The Indic Persona Hub is a large-scale, India-centric dataset of synthetic virtual personas created to support research in natural language processing (NLP), personalization, socially-aware machine learning, synthetic data generation, and dataset augmentation. This repository provides a representative sample release containing 20 million personas, while the complete dataset will be publicly released after paper acceptance. The dataset contains 20,000,000 synthetic persona records spanning 90,633 domain buckets, covering a wide range of India-relevant themes such as education, business, healthcare, environment, technology, traditional knowledge systems, space, sustainability, social issues, and more. Each persona is a long-form human-readable description that includes background, interests, values, and likely discussion topics. Personas were generated using two complementary synthetic generation approaches: Text-to-Persona, where large language models infer plausible personas from India-centric web and open-source texts, and Persona-to-Persona, where additional personas are derived from existing ones using relationship and role transformations to improve representation of low-visibility groups and relational identities. The dataset is released in Apache Parquet (.parquet) format for efficient storage and high-speed large-scale processing. The provided schema is compact and includes three fields: a unique identifier (id), a consolidated persona description (persona), and a domain label (domain). This structure is designed to simplify downstream tasks such as indexing, search, clustering, and embedding-based retrieval. All personas are fully synthetic and do not correspond to real individuals. The dataset is designed to reflect India’s cultural and regional diversity while minimizing personally identifiable information. Users should be aware that synthetic generation may inherit biases from source texts and models, and responsible use is strongly recommended, particularly for high-stakes applications.
提供机构:
Harvard Dataverse
创建时间:
2026-05-07



