five

nvidia/Nemotron-Personas-India

收藏
Hugging Face2025-12-16 更新2025-10-25 收录
下载链接:
https://hf-mirror.com/datasets/nvidia/Nemotron-Personas-India
下载链接
链接失效反馈
官方服务:
资源简介:
Nemotron-Personas-India 是一个开源(CC BY 4.0)的合成人物数据集,基于印度真实世界的统计人口分布,旨在捕捉印度人口的多样性和丰富性。该数据集是 [Nemotron-Personas](https://huggingface.co/datasets/nvidia/Nemotron-Personas) 的变体,是第一个与印度统计数据相匹配的印度语数据集,包括姓名、性别、年龄、宗教、语言、背景、婚姻状况、教育和职业等属性。它为各种建模用例提供高质量的英印双语人物,并支持印度模型构建者开发包含重要地区特定人口统计和文化背景的 [Sovereign AI](https://www.nvidia.com/en-us/lp/industries/global-public-sector/sovereign-ai-technical-overview/) 系统。该数据集使用 [NeMo Data Designer](https://docs.nvidia.com/nemo/microservices/latest/generate-synthetic-data/index.html) 生成,采用专有的概率图模型(PGM)和 Apache-2.0 授权的 GPT-OSS-120B 模型,并内置了一组不断扩展的验证器和评估器。该数据集适用于商业用途。

Nemotron-Personas-India is an open-source (CC BY 4.0) dataset of synthetically-generated personas grounded in real-world demographic, geographic, and personality trait distributions in India, capturing the diversity and richness of the Indian population. It is a variant of [Nemotron-Personas](https://huggingface.co/datasets/nvidia/Nemotron-Personas) and the first Indic dataset aligned with statistics for names, sex, age, religion, spoken languages, background, marital status, education, and occupation. The dataset provides high-quality personas in both English and Hindi (Devanagari and Latin scripts) for various modeling use-cases. It supports Indian model builders in developing Sovereign AI systems incorporating region-specific demographics and cultural context, improving diversity in synthetically-generated data, mitigating biases, and preventing model collapse. The dataset is designed to be more representative of underlying demographic distributions along multiple axes, such as age, geography, spoken languages, education, occupation, and religious identities. Produced using NeMo Data Designer, it leverages a proprietary Probabilistic Graphical Model (PGM) and an Apache-2.0-licensed GPT-OSS-120B model, along with a growing set of validators and evaluators. The dataset is ready for commercial use.
提供机构:
nvidia
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作