five

Synthetic-Indic-Multi-Lingual-Dataset-From-IndicPersonaHub

收藏
DataCite Commons2026-05-07 更新2026-05-18 收录
下载链接:
https://dataverse.harvard.edu/citation?persistentId=doi:10.7910/DVN/ZNCTGR
下载链接
链接失效反馈
官方服务:
资源简介:
The Indic PersonaHub Domain QA Synthetic Corpus is a culturally grounded, large-scale synthetic instruction dataset generated using Indic PersonaHub, a massive India-centric synthetic population designed to reflect the linguistic, cultural, occupational, and cognitive diversity of Indian society. The dataset is intended for training and evaluating large language models on India-specific reasoning, domain expertise, long-form instruction following, and multilingual alignment. Each data sample is anchored to a unique synthetic persona that is explicitly tagged with a domain of expertise (e.g., education, healthcare, agriculture, governance, engineering, traditional knowledge, etc.). Using this domain tag, the dataset generation pipeline follows a persona-consistent two-step process: Persona-Conditioned Question Generation: Each persona is prompted to generate a thought-provoking and domain-relevant question that reflects what someone with that background and expertise would realistically ask. The questions are designed to be non-trivial and often incorporate Indian societal constraints such as local infrastructure, cultural context, policy realities, and region-specific challenges. Persona-Conditioned Long-Form Answering: The same persona then answers its own question with a detailed response of approximately 800–900 words, ensuring consistency with its expertise, lived experience, and sociocultural identity. The responses are structured, reasoning-heavy, and grounded in Indian context, producing high-entropy text suitable for instruction tuning and long-form generation training. The underlying Indic PersonaHub population was constructed using a dual strategy: Text-to-Persona, where LLMs infer personas from Indian web crawls and regional literature to capture long-tail diversity, and Persona-to-Persona, where additional personas are generated through relational expansion to improve representation of undercovered groups such as rural communities, informal workers, and elderly populations. Personas are filtered through demographic taxonomy coverage checks and semantic deduplication to ensure diversity and reduce redundancy. To support multilingual training, generated outputs are passed through a hybrid translation pipeline to produce aligned versions across 22 Indian languages, enabling cross-lingual instruction tuning and culturally consistent multilingual modeling. Quality is enforced using two automated LLM-based evaluators: a Cultural Compliance Judge, which checks alignment with Indian cultural norms and avoids stereotyping, and a Task Relevance Judge, which ensures that the question and answer are meaningful, domain-consistent, and sufficiently detailed. Overall, this dataset provides persona-grounded, domain-specific, long-form question–answer instruction data that is optimized for building culturally aligned and multilingual language models for the Indian context.
提供机构:
Harvard Dataverse
创建时间:
2026-05-06
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作