Indic-PersonaHub

Name: Indic-PersonaHub
Creator: Harvard Dataverse
Published: 2026-05-07 13:25:24
License: 暂无描述

DataCite Commons2026-05-07 更新2026-05-18 收录

下载链接：

https://dataverse.harvard.edu/citation?persistentId=doi:10.7910/DVN/D3SGJK

下载链接

链接失效反馈

官方服务：

资源简介：

Dataset Description: Indic Persona Hub (Representative Sample Release) The Indic Persona Hub is a large-scale, India-centric dataset of synthetic virtual personas created to support research in natural language processing (NLP), personalization, socially-aware machine learning, synthetic data generation, and dataset augmentation. This repository provides a representative sample release containing 20 million personas, while the complete dataset will be publicly released after paper acceptance. The dataset contains 20,000,000 synthetic persona records spanning 90,633 domain buckets, covering a wide range of India-relevant themes such as education, business, healthcare, environment, technology, traditional knowledge systems, space, sustainability, social issues, and more. Each persona is a long-form human-readable description that includes background, interests, values, and likely discussion topics. Personas were generated using two complementary synthetic generation approaches: Text-to-Persona, where large language models infer plausible personas from India-centric web and open-source texts, and Persona-to-Persona, where additional personas are derived from existing ones using relationship and role transformations to improve representation of low-visibility groups and relational identities. The dataset is released in Apache Parquet (.parquet) format for efficient storage and high-speed large-scale processing. The provided schema is compact and includes three fields: a unique identifier (id), a consolidated persona description (persona), and a domain label (domain). This structure is designed to simplify downstream tasks such as indexing, search, clustering, and embedding-based retrieval. All personas are fully synthetic and do not correspond to real individuals. The dataset is designed to reflect India’s cultural and regional diversity while minimizing personally identifiable information. Users should be aware that synthetic generation may inherit biases from source texts and models, and responsible use is strongly recommended, particularly for high-stakes applications.

提供机构：

Harvard Dataverse

创建时间：

2026-05-07

5,000+

优质数据集

54 个

任务类型

进入经典数据集