five

Nemotron-Personas

收藏
魔搭社区2026-01-06 更新2025-06-14 收录
下载链接:
https://modelscope.cn/datasets/AI-ModelScope/Nemotron-Personas
下载链接
链接失效反馈
官方服务:
资源简介:
Nemotron-Personas-USA ========================================================================= <center> <img src="images/nemotron_persona_approach.png" alt="Nemotron-Personas-USA" width="400px"> <p><em>A compound AI approach to personas grounded in real-world distributions</em></p> </center> # v1.1 Update The v1.1 update introduces the following changes: * leverage `openai/gpt-oss-120b` model instead of `mistralai/Mixtral-8x22B-v0.1` model to improve data quality and diversity * increase the number of records from 100k to 1M, for a total of 0.94B tokens * update the dataset name to Nemotron-Personas-USA in order to differentiate it from other region-specific datasets in the [Nemotron-Personas collection](https://huggingface.co/collections/nvidia/nemotron-personas). # Dataset Overview Nemotron-Personas-USA is an open-source (CC BY 4.0) dataset of synthetically-generated personas grounded in real-world demographic, geographic and personality trait distributions to capture the diversity and richness of the population. It is the first dataset of its kind aligned with statistics for names, sex, age, background, marital status, education, occupation and location, among other attributes. With an initial release focused on the United States, this dataset provides high-quality personas for a variety of modeling use-cases. The dataset can be used to improve diversity of synthetically-generated data, mitigate data/model biases, and prevent model collapse. In particular, the dataset is designed to be more representative of underlying demographic distributions along multiple axes, including age (e.g. older personas), geography (e.g., rural personas), education, occupation and ethnicity, as compared to past persona datasets. Produced using [NeMo Data Designer](https://docs.nvidia.com/nemo/microservices/latest/generate-synthetic-data/index.html), an enterprise-grade compound AI system for synthetic data generation, the dataset leverages a proprietary Probabilistic Graphical Model (PGM) along with an Apache-2.0-licensed `openai/gpt-oss-120b` model and an ever-expanding set of validators and evaluators built into Data Designer. An extended version of Nemotron-Personas-USA is available for use in NeMo Data Designer itself. This dataset is ready for commercial/non-commercial use. ## What is NOT in the dataset Given the emphasis on personas, the dataset excludes other fields available in Data Designer, e.g., first/middle/last names and synthetic addresses. Also excluded are personas generally of relevance to enterprise clients (e.g., finance, healthcare). Please [reach out](https://www.nvidia.com/en-us/data-center/products/ai-enterprise/contact-sales/) to explore enterprise use-cases. All data, while mirroring real-world distributions, is completely artificially generated. Any similarity in names or persona descriptions to actual persons, living or dead, is purely coincidental. # Data Developer NVIDIA Corporation # Release Date Hugging Face 06/09/2025 via https://huggingface.co/datasets/nvidia/Nemotron-Personas # Dataset Creation Date 06/09/2025 # License/Terms of Use This dataset is licensed under the [Creative Commons Attribution 4.0 International License](https://creativecommons.org/licenses/by/4.0/deed.en) (CC BY 4.0). # Use Case Developers working on Sovereign AI, training LLMs, and/or looking to improve diversity of synthetically generated data, mitigate data/model biases, and prevent model collapse. # Data Version 1.0 (06/09/2025) 1.1 (10/28/2025) # Intended use The Nemotron-Personas-USA dataset is intended to be used by the community to continue to improve open models and push the state of the art. The data may be freely used to train any model. We welcome feedback from the open-source community and invite developers, researchers, and data enthusiasts to explore the dataset and build upon it. The Nemotron-Personas-USA dataset is grounded in distributions of self-reported demographic data in the US Census. As such, its primary goal is to combat missing data and/or potential biases present in model training data today, especially when it comes to existing persona datasets used in synthetic data generation. Despite the improved data diversity and fidelity to the US population, we are still limited by data availability and reasonable model complexity. This results in some necessary independence assumptions; for instance, that occupations are independent of location (zip code) given education, age and sex. Similarly, comprehensive statistics on gender, independent of sex, are not available from the Census Bureau. We leave further efforts to improve fidelity to future work. Note that the dataset is focused on adults only. # Dataset Details The dataset contains: * 6M personas across 1M records (6 persona fields & 16 contextual fields) * ~ 936M tokens, including ~371M persona tokens * 29k geographic areas (ZCTAs) and 15.2k cities across all 50 states + Puerto Rico and Virgin Islands * 970k unique full names * 560+ professional occupations, all grounded in real-world distributions * Comprehensive coverage across demographic and personality trait distributions ## Seed Data In order to capture the socio-demographic and geographic diversity and complexity of the US population, Nemotron-Personas-USA leveraged open-source ([CC0-licensed](https://creativecommons.org/public-domain/cc0/)) aggregated statistical data from * The US Census Bureau, specifically the [American Community Survey](https://catalog.data.gov/dataset/american-community-survey-5-year-estimates-data-profiles-5-year). * The study “Race and ethnicity data for first, middle, and surnames,” [Rosenman et al. (2023)](https://www.nature.com/articles/s41597-023-02202-2); specifically, the dataset located [here](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/SGKW0K). ## Schema The dataset includes 22 fields: 6 persona fields and 16 contextual fields shown below. Researchers will find many contextual fields useful in zoning in on specific personas, which is challenging to do with existing datasets. <center> <img src="images/nemotron_personas_schema.png" width="700px"> </center> ## Field & Token Counts 0.94B tokens across 1M records and 22 columns, excluding the globally unique identifier. Note that data covers 50 states as well as Puerto Rico and Virgin Islands. <center> <img src="images/nemotron_personas_field_stats.png" width="500px"> </center> # Dataset Description & Quality Assessment The analysis below provides a breakdown across various axes of the dataset to emphasize the built-in diversity and pattern complexity of data. ## Names Since the focus of this dataset is on personas, names aren’t provided as dedicated fields. However, infused into persona prompts are 136,000 unique first_names, 126,000 unique middle names, and 338,000 unique surnames sourced from [Rosenman et al. (2023)](https://www.nature.com/articles/s41597-023-02202-2). ## Age distribution The distribution of our persona ages takes the form of a bulging population pyramid that reflects historical birth rates, mortality trends, and migration patterns. This is in stark contrast to a bell curve distribution typically produced by an LLM alone. Overall the distribution is right-skewed and distinctly non-Gaussian. Note that minors are excluded from this dataset (see the Ethics section below). <center> <img src="images/nemotron_personas_age_group_distribution.png" width="600px"> </center> ## Marital Status by Age Group The heatmap below displays the fraction of people for each age cohort who are (1) never married, (2) currently married, (3) separated, (4) divorced, or (5) widowed. It highlights how marital status shifts over the life course in the US with “never married” dominating late teens and early twenties, “married” climbing rapidly in twenties and peaking in mid-fourties, divorced and widowed being much more pronounced in later stages of life. All of these considerations are of relevance to informing life experiences and personas. <center> <img src="images/nemotron_personas_marital_status_distribution.png" width="600px"> </center> ## Education Level by Age Group The heatmap below captures intricate patterns of educational attainment across age cohorts. For example, note how the share of high-school-only and no-diploma individuals ebbs then resurges among the oldest age groups, reflecting historical shifts in access and in social norms. <center> <img src="images/nemotron_personas_education_distribution.png" width="600px"> </center> ## Geographic Intricacies of Education Attainment This slice of our dataset demonstrates how geography informs education and therefore persona descriptions. The choropleth map shows, for each U.S. state, the share of residents age 25 and older who hold at least a bachelor’s degree. No LLM in our testing was able to generate data of this fidelity. <center> <img src="images/nemotron_personas_education_map.png" width="700px"> <p><em>Left: Nemotron-Personas-USA dataset. Right: <a href="https://en.wikipedia.org/wiki/Educational_attainment_in_the_United_States">Educational attainment in the United States, Wikipedia</a></em></p> </center> ## Occupational Categories The treemap below reflects the richness of our dataset with respect to professional occupations of personas. Represented in our dataset are over 560 occupation categories that are further informed by demographic and geographic distributions. <center> <img src="images/nemotron_personas_occupation_tree_map.png" width="600px"> </center> ## Persona diversity The attributes above (and many more) ultimately affect the diversity of the synthetic personas being generated. As an example, the analysis below highlights a multitude of clusters within professional persona descriptions. These clusters are identified by clustering embeddings and reducing dimensionality to 2D. <center> <img src="images/nemotron_personas_professional_personas_clustering.png" width="600px"> </center> # How to use it You can load the dataset with the following lines of code. ```python from datasets import load_dataset nemotron_personas = load_dataset("nvidia/Nemotron-Personas-USA") ``` # Dataset Characterization ## Data Collection Method * Hybrid: Human, Synthetic, Automated ## Labeling Method * Not Applicable ## Dataset Format * Text ## Dataset Quantification * Record counts: 1M records (6M persona descriptions) * Total data storage: 2.6GB # Ethical Considerations: NVIDIA believes [Trustworthy AI](https://www.nvidia.com/en-us/ai-data-science/trustworthy-ai/) is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal teams to ensure this dataset meets requirements for the relevant industry and use case and addresses unforeseen product misuse. Please report security vulnerabilities or NVIDIA AI concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/). # Citation If you find the data useful, please cite: ``` @software{nvidia/Nemotron-Personas-USA, author = {Meyer, Yev and Corneil, Dane}, title = {{Nemotron-Personas-USA}: Synthetic Personas Aligned to Real-World Distributions }, month = {June}, year = {2025}, url = {https://huggingface.co/datasets/nvidia/Nemotron-Personas-USA} } ```

# Nemotron-Personas-USA ========================================================================= <center> <img src="images/nemotron_persona_approach.png" alt="Nemotron-Personas-USA" width="400px"> <p><em>基于真实世界分布的复合人工智能角色生成方法</em></p> </center> # v1.1 版本更新 v1.1版本更新包含以下改进: * 改用`openai/gpt-oss-120b`模型替代`mistralai/Mixtral-8x22B-v0.1`模型,以提升数据质量与多样性 * 将记录数从10万扩充至100万,总token数达0.94亿 * 更新数据集名称为Nemotron-Personas-USA,以便与[Nemotron-Personas 数据集合集](https://huggingface.co/collections/nvidia/nemotron-personas)中的其他区域专属数据集区分开。 # 数据集概览 Nemotron-Personas-USA是一款开源(CC BY 4.0)的合成角色数据集,基于真实世界的人口统计、地理与人格特质分布生成,旨在精准还原美国人口的多元性与丰富性。它是首个贴合姓名、性别、年龄、背景、婚姻状况、教育程度、职业与地域等多维度统计规律的同类数据集。本数据集初始版本聚焦美国市场,可为各类建模场景提供高质量角色数据。 本数据集可用于提升合成数据的多样性、缓解数据/模型偏差,以及防止模型坍塌。具体而言,与过往角色数据集相比,本数据集在多个维度上更贴合底层人口统计分布,包括年龄(如老年角色)、地理分布(如乡村角色)、教育程度、职业与族裔等。 本数据集由[NeMo 数据设计师(NeMo Data Designer)](https://docs.nvidia.com/nemo/microservices/latest/generate-synthetic-data/index.html)开发,该工具是一款面向合成数据生成的企业级复合人工智能系统。数据集依托专有概率图模型(Probabilistic Graphical Model, PGM),以及遵循Apache-2.0许可的`openai/gpt-oss-120b`模型,同时集成了NeMo数据设计师中不断扩展的验证器与评估器套件。Nemotron-Personas-USA的扩展版本可直接在NeMo数据设计师中使用。 本数据集支持商业与非商业用途。 ## 数据集未包含的内容 鉴于本数据集聚焦角色生成,因此未包含NeMo数据设计师中的其他字段,例如首名/中间名/姓氏与合成地址。同时未涵盖与企业客户相关的角色(如金融、医疗领域角色)。如需探索企业级应用场景,请[联系我们](https://www.nvidia.com/en-us/data-center/products/ai-enterprise/contact-sales/)。 所有数据虽贴合真实世界分布,但均为完全人工合成。任何与真实人物(无论在世或已故)的姓名或角色描述相似之处,均纯属巧合。 # 数据开发方 英伟达公司(NVIDIA Corporation) # 发布日期 2025年6月9日,通过https://huggingface.co/datasets/nvidia/Nemotron-Personas 发布 # 数据集创建日期 2025年6月9日 # 许可协议与使用条款 本数据集采用[知识共享署名4.0国际许可协议(Creative Commons Attribution 4.0 International License, CC BY 4.0)](https://creativecommons.org/licenses/by/4.0/deed.en)进行许可。 # 适用场景 面向主权人工智能开发、大语言模型(Large Language Model, LLM)训练,以及希望提升合成数据多样性、缓解数据/模型偏差、防止模型坍塌的开发者。 # 数据版本 1.0(2025年6月9日) 1.1(2025年10月28日) # 预期用途 Nemotron-Personas-USA数据集旨在供社区用于持续改进开源模型并推动技术前沿发展。该数据可自由用于任何模型的训练。我们欢迎开源社区的反馈,并邀请开发者、研究人员与数据爱好者探索本数据集并在此基础上开展工作。 Nemotron-Personas-USA数据集基于美国人口普查局的自我报告人口统计数据分布构建。因此,其核心目标是解决当前模型训练数据中存在的数据缺失与潜在偏差问题,尤其是在合成数据生成所用的现有角色数据集方面。尽管本数据集在数据多样性与贴合美国人口现状方面有所改进,但我们仍受限于数据可得性与合理的模型复杂度,因此不得不做出一些必要的独立性假设。例如,假设在给定教育程度、年龄与性别的前提下,职业与地域(邮政编码)相互独立。类似地,美国人口普查局未提供与性别无关的全面性别统计数据。我们将在未来的工作中进一步提升数据的贴合度。 请注意,本数据集仅针对成年人。 # 数据集详情 本数据集包含: * 100万条记录,共600万个角色(包含6个角色字段与16个上下文字段) * 约9.36亿个token,其中角色相关token约3.71亿个 * 覆盖全美50个州、波多黎各与美属维尔京群岛的2.9万个地理区域(ZCTAs,邮政编码统计区域)与1.52万个城市 * 97万个唯一全名 * 560余种职业,均贴合真实世界分布 * 全面覆盖人口统计与人格特质分布维度 ## 种子数据 为精准还原美国人口的社会人口统计与地理多样性及复杂性,Nemotron-Personas-USA采用了开源(CC0许可)的聚合统计数据,来源包括: * 美国人口普查局,具体为[美国社区调查(American Community Survey)](https://catalog.data.gov/dataset/american-community-survey-5-year-estimates-data-profiles-5-year) * 研究《Race and ethnicity data for first, middle, and surnames》[Rosenman等人(2023)](https://www.nature.com/articles/s41597-023-02202-2),具体数据集可在[此处](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/SGKW0K)获取 ## 数据架构 本数据集包含22个字段:6个角色字段与16个上下文字段,如下所示。研究人员可通过众多上下文字段精准定位特定角色,这是现有数据集难以实现的。 <center> <img src="images/nemotron_personas_schema.png" width="700px"> </center> ## 字段与Token统计 剔除全局唯一标识符后,100万条记录、22个字段总计0.94亿个token。本数据集覆盖全美50个州、波多黎各与美属维尔京群岛。 <center> <img src="images/nemotron_personas_field_stats.png" width="500px"> </center> # 数据集描述与质量评估 下述分析从多个维度拆解本数据集,以凸显其内置的数据多样性与模式复杂性。 ## 姓名 由于本数据集聚焦角色生成,因此未将姓名作为独立字段。不过,在角色提示词中嵌入了源自[Rosenman等人(2023)](https://www.nature.com/articles/s41597-023-02202-2)的13.6万个唯一名字、12.6万个唯一中间名与33.8万个唯一姓氏。 ## 年龄分布 本数据集的角色年龄分布呈隆起型人口金字塔结构,贴合历史出生率、死亡率趋势与移民模式。这与仅由大语言模型生成的典型钟形曲线分布形成鲜明对比。整体分布呈右偏态,明显不符合高斯分布。请注意,本数据集未包含未成年人(详见下文伦理考量章节)。 <center> <img src="images/nemotron_personas_age_group_distribution.png" width="600px"> </center> ## 分年龄组婚姻状况 下述热力图展示了各年龄组人群的婚姻状况占比:(1) 从未结婚;(2) 已婚;(3) 分居;(4) 离婚;(5) 丧偶。该图凸显了美国婚姻状况随人生阶段的变化规律:“从未结婚”在青少年晚期与二十岁早期占比最高,“已婚”比例在二十岁期间快速攀升并在四十岁中期达到峰值,离婚与丧偶比例则在人生后期显著升高。所有这些特征均对人生经历与角色塑造具有重要参考价值。 <center> <img src="images/nemotron_personas_marital_status_distribution.png" width="600px"> </center> ## 分年龄组教育程度 下述热力图展示了各年龄组人群的教育程度分布的复杂模式。例如,请注意仅接受高中教育或未取得高中文凭的人群比例在最年长的年龄组中先下降后回升,这反映了教育资源获取与社会规范的历史变迁。 <center> <img src="images/nemotron_personas_education_distribution.png" width="600px"> </center> ## 教育程度的地理分布特征 本数据集的这一子集展示了地理因素如何影响教育程度,进而影响角色描述。该分级统计图展示了美国各州25岁及以上人群中至少持有学士学位的比例。我们的测试显示,尚无任何大语言模型能够生成如此贴合现实的数据。 <center> <img src="images/nemotron_personas_education_map.png" width="700px"> <p><em>左图:Nemotron-Personas-USA数据集。右图:[美国教育程度](https://en.wikipedia.org/wiki/Educational_attainment_in_the_United_States),维基百科</em></p> </center> ## 职业类别 下述树形图展示了本数据集在角色职业维度上的丰富性。本数据集涵盖560余个职业类别,且各职业的分布均贴合人口统计与地理分布规律。 <center> <img src="images/nemotron_personas_occupation_tree_map.png" width="600px"> </center> ## 角色多样性 上述(以及更多)属性最终决定了合成角色的多样性。如下文分析所示,专业角色描述中存在多个聚类。这些聚类通过对嵌入向量进行聚类并将维度降至2D后识别得到。 <center> <img src="images/nemotron_personas_professional_personas_clustering.png" width="600px"> </center> # 使用方法 你可以通过以下代码加载本数据集: python from datasets import load_dataset nemotron_personas = load_dataset("nvidia/Nemotron-Personas-USA") # 数据集特征 ## 数据收集方式 * 混合式:人工、合成、自动化 ## 标注方式 * 不适用 ## 数据集格式 * 文本 ## 数据集规模 * 记录数:100万条(含600万个角色描述) * 总数据存储量:2.6GB # 伦理考量 英伟达(NVIDIA)认为[可信人工智能(Trustworthy AI)](https://www.nvidia.com/en-us/ai-data-science/trustworthy-ai/)是一项共同责任,我们已建立相关政策与实践规范,以支持各类人工智能应用的开发。开发者在按照本服务条款下载或使用本数据集时,应与其内部团队协作,确保本数据集符合相关行业与应用场景的要求,并解决潜在的产品误用问题。 请[在此处](https://www.nvidia.com/en-us/support/submit-security-vulnerability/)报告安全漏洞或英伟达人工智能相关问题。 # 引用方式 如您认为本数据集对您的工作有所帮助,请引用以下内容: @software{nvidia/Nemotron-Personas-USA, author = {Meyer, Yev and Corneil, Dane}, title = {{Nemotron-Personas-USA}: Synthetic Personas Aligned to Real-World Distributions }, month = {June}, year = {2025}, url = {https://huggingface.co/datasets/nvidia/Nemotron-Personas-USA} }
提供机构:
maas
创建时间:
2025-06-11
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作