zhenjuan/Nemotron-Personas-USA
收藏Hugging Face2025-12-27 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/zhenjuan/Nemotron-Personas-USA
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-4.0
task_categories:
- text-generation
language:
- en
tags:
- synthetic
- personas
- NVIDIA
- datadesigner
size_categories:
- 1M<n<10M
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
dataset_info:
features:
- name: uuid
dtype: string
- name: professional_persona
dtype: string
- name: sports_persona
dtype: string
- name: arts_persona
dtype: string
- name: travel_persona
dtype: string
- name: culinary_persona
dtype: string
- name: persona
dtype: string
- name: cultural_background
dtype: string
- name: skills_and_expertise
dtype: string
- name: skills_and_expertise_list
dtype: string
- name: hobbies_and_interests
dtype: string
- name: hobbies_and_interests_list
dtype: string
- name: career_goals_and_ambitions
dtype: string
- name: sex
dtype: string
- name: age
dtype: int64
- name: marital_status
dtype: string
- name: education_level
dtype: string
- name: bachelors_field
dtype: string
- name: occupation
dtype: string
- name: city
dtype: string
- name: state
dtype: string
- name: zipcode
dtype: string
- name: country
dtype: string
splits:
- name: train
num_bytes: 5328684597
num_examples: 1000000
download_size: 2686692730
dataset_size: 5328684597
---
Nemotron-Personas-USA
=========================================================================
<center>
<img src="images/nemotron_persona_approach.png" alt="Nemotron-Personas-USA" width="400px">
<p><em>A compound AI approach to personas grounded in real-world distributions</em></p>
</center>
# v1.1 Update
The v1.1 update introduces the following changes:
* leverage `openai/gpt-oss-120b` model instead of `mistralai/Mixtral-8x22B-v0.1` model to improve data quality and diversity
* increase the number of records from 100k to 1M, for a total of 0.94B tokens
* update the dataset name to Nemotron-Personas-USA in order to differentiate it from other region-specific datasets in the [Nemotron-Personas collection](https://huggingface.co/collections/nvidia/nemotron-personas).
# Dataset Overview
Nemotron-Personas-USA is an open-source (CC BY 4.0) dataset of synthetically-generated personas grounded in real-world demographic, geographic and personality trait distributions to capture the diversity and richness of the population. It is the first dataset of its kind aligned with statistics for names, sex, age, background, marital status, education, occupation and location, among other attributes. With an initial release focused on the United States, this dataset provides high-quality personas for a variety of modeling use-cases.
The dataset can be used to improve diversity of synthetically-generated data, mitigate data/model biases, and prevent model collapse. In particular, the dataset is designed to be more representative of underlying demographic distributions along multiple axes, including age (e.g. older personas), geography (e.g., rural personas), education, occupation and ethnicity, as compared to past persona datasets.
Produced using [NeMo Data Designer](https://docs.nvidia.com/nemo/microservices/latest/generate-synthetic-data/index.html), an enterprise-grade compound AI system for synthetic data generation, the dataset leverages a proprietary Probabilistic Graphical Model (PGM) along with an Apache-2.0-licensed `openai/gpt-oss-120b` model and an ever-expanding set of validators and evaluators built into Data Designer. An extended version of Nemotron-Personas-USA is available for use in NeMo Data Designer itself.
This dataset is ready for commercial/non-commercial use.
## What is NOT in the dataset
Given the emphasis on personas, the dataset excludes other fields available in Data Designer, e.g., first/middle/last names and synthetic addresses. Also excluded are personas generally of relevance to enterprise clients (e.g., finance, healthcare). Please [reach out](https://www.nvidia.com/en-us/data-center/products/ai-enterprise/contact-sales/) to explore enterprise use-cases.
All data, while mirroring real-world distributions, is completely artificially generated. Any similarity in names or persona descriptions to actual persons, living or dead, is purely coincidental.
# Data Developer
NVIDIA Corporation
# Release Date
Hugging Face 06/09/2025 via https://huggingface.co/datasets/nvidia/Nemotron-Personas
# Dataset Creation Date
06/09/2025
# License/Terms of Use
This dataset is licensed under the [Creative Commons Attribution 4.0 International License](https://creativecommons.org/licenses/by/4.0/deed.en) (CC BY 4.0).
# Use Case
Developers working on Sovereign AI, training LLMs, and/or looking to improve diversity of synthetically generated data, mitigate data/model biases, and prevent model collapse.
# Data Version
1.0 (06/09/2025)
1.1 (10/28/2025)
# Intended use
The Nemotron-Personas-USA dataset is intended to be used by the community to continue to improve open models and push the state of the art. The data may be freely used to train any model. We welcome feedback from the open-source community and invite developers, researchers, and data enthusiasts to explore the dataset and build upon it.
The Nemotron-Personas-USA dataset is grounded in distributions of self-reported demographic data in the US Census. As such, its primary goal is to combat missing data and/or potential biases present in model training data today, especially when it comes to existing persona datasets used in synthetic data generation. Despite the improved data diversity and fidelity to the US population, we are still limited by data availability and reasonable model complexity. This results in some necessary independence assumptions; for instance, that occupations are independent of location (zip code) given education, age and sex. Similarly, comprehensive statistics on gender, independent of sex, are not available from the Census Bureau. We leave further efforts to improve fidelity to future work.
Note that the dataset is focused on adults only.
# Dataset Details
The dataset contains:
* 6M personas across 1M records (6 persona fields & 16 contextual fields)
* ~ 936M tokens, including ~371M persona tokens
* 29k geographic areas (ZCTAs) and 15.2k cities across all 50 states + Puerto Rico and Virgin Islands
* 970k unique full names
* 560+ professional occupations, all grounded in real-world distributions
* Comprehensive coverage across demographic and personality trait distributions
## Seed Data
In order to capture the socio-demographic and geographic diversity and complexity of the US population, Nemotron-Personas-USA leveraged open-source ([CC0-licensed](https://creativecommons.org/public-domain/cc0/)) aggregated statistical data from
* The US Census Bureau, specifically the [American Community Survey](https://catalog.data.gov/dataset/american-community-survey-5-year-estimates-data-profiles-5-year).
* The study “Race and ethnicity data for first, middle, and surnames,” [Rosenman et al. (2023)](https://www.nature.com/articles/s41597-023-02202-2); specifically, the dataset located [here](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/SGKW0K).
## Schema
The dataset includes 22 fields: 6 persona fields and 16 contextual fields shown below. Researchers will find many contextual fields useful in zoning in on specific personas, which is challenging to do with existing datasets.
<center>
<img src="images/nemotron_personas_schema.png" width="700px">
</center>
## Field & Token Counts
0.94B tokens across 1M records and 22 columns, excluding the globally unique identifier. Note that data covers 50 states as well as Puerto Rico and Virgin Islands.
<center>
<img src="images/nemotron_personas_field_stats.png" width="500px">
</center>
# Dataset Description & Quality Assessment
The analysis below provides a breakdown across various axes of the dataset to emphasize the built-in diversity and pattern complexity of data.
## Names
Since the focus of this dataset is on personas, names aren’t provided as dedicated fields. However, infused into persona prompts are 136,000 unique first_names, 126,000 unique middle names, and 338,000 unique surnames sourced from [Rosenman et al. (2023)](https://www.nature.com/articles/s41597-023-02202-2).
## Age distribution
The distribution of our persona ages takes the form of a bulging population pyramid that reflects historical birth rates, mortality trends, and migration patterns. This is in stark contrast to a bell curve distribution typically produced by an LLM alone. Overall the distribution is right-skewed and distinctly non-Gaussian. Note that minors are excluded from this dataset (see the Ethics section below).
<center>
<img src="images/nemotron_personas_age_group_distribution.png" width="600px">
</center>
## Marital Status by Age Group
The heatmap below displays the fraction of people for each age cohort who are (1) never married, (2) currently married, (3) separated, (4) divorced, or (5) widowed. It highlights how marital status shifts over the life course in the US with “never married” dominating late teens and early twenties, “married” climbing rapidly in twenties and peaking in mid-fourties, divorced and widowed being much more pronounced in later stages of life. All of these considerations are of relevance to informing life experiences and personas.
<center>
<img src="images/nemotron_personas_marital_status_distribution.png" width="600px">
</center>
## Education Level by Age Group
The heatmap below captures intricate patterns of educational attainment across age cohorts. For example, note how the share of high-school-only and no-diploma individuals ebbs then resurges among the oldest age groups, reflecting historical shifts in access and in social norms.
<center>
<img src="images/nemotron_personas_education_distribution.png" width="600px">
</center>
## Geographic Intricacies of Education Attainment
This slice of our dataset demonstrates how geography informs education and therefore persona descriptions. The choropleth map shows, for each U.S. state, the share of residents age 25 and older who hold at least a bachelor’s degree. No LLM in our testing was able to generate data of this fidelity.
<center>
<img src="images/nemotron_personas_education_map.png" width="700px">
<p><em>Left: Nemotron-Personas-USA dataset. Right: <a href="https://en.wikipedia.org/wiki/Educational_attainment_in_the_United_States">Educational attainment in the United States, Wikipedia</a></em></p>
</center>
## Occupational Categories
The treemap below reflects the richness of our dataset with respect to professional occupations of personas. Represented in our dataset are over 560 occupation categories that are further informed by demographic and geographic distributions.
<center>
<img src="images/nemotron_personas_occupation_tree_map.png" width="600px">
</center>
## Persona diversity
The attributes above (and many more) ultimately affect the diversity of the synthetic personas being generated. As an example, the analysis below highlights a multitude of clusters within professional persona descriptions. These clusters are identified by clustering embeddings and reducing dimensionality to 2D.
<center>
<img src="images/nemotron_personas_professional_personas_clustering.png" width="600px">
</center>
# How to use it
You can load the dataset with the following lines of code.
```python
from datasets import load_dataset
nemotron_personas = load_dataset("nvidia/Nemotron-Personas-USA")
```
# Dataset Characterization
## Data Collection Method
* Hybrid: Human, Synthetic, Automated
## Labeling Method
* Not Applicable
## Dataset Format
* Text
## Dataset Quantification
* Record counts: 1M records (6M persona descriptions)
* Total data storage: 2.6GB
# Ethical Considerations:
NVIDIA believes [Trustworthy AI](https://www.nvidia.com/en-us/ai-data-science/trustworthy-ai/) is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal teams to ensure this dataset meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
Please report security vulnerabilities or NVIDIA AI concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/).
# Citation
If you find the data useful, please cite:
```
@software{nvidia/Nemotron-Personas-USA,
author = {Meyer, Yev and Corneil, Dane},
title = {{Nemotron-Personas-USA}: Synthetic Personas Aligned to Real-World Distributions
},
month = {June},
year = {2025},
url = {https://huggingface.co/datasets/nvidia/Nemotron-Personas-USA}
}
```
license: CC BY 4.0
任务类别:
- 文本生成(text-generation)
语言:
- 英语(en)
标签:
- 合成数据(synthetic)
- 人物设定(personas)
- 英伟达(NVIDIA)
- 数据设计器(datadesigner)
规模类别:
- 100万<n<1000万
配置项:
- 配置名称:default
数据文件:
- 拆分集:训练集(train)
路径:data/train-*
数据集信息:
特征:
- 名称:uuid,数据类型:字符串(string)
- 名称:职业人物设定(professional_persona),数据类型:字符串
- 名称:体育人物设定(sports_persona),数据类型:字符串
- 名称:艺术人物设定(arts_persona),数据类型:字符串
- 名称:旅行人物设定(travel_persona),数据类型:字符串
- 名称:烹饪人物设定(culinary_persona),数据类型:字符串
- 名称:人物设定(persona),数据类型:字符串
- 名称:文化背景(cultural_background),数据类型:字符串
- 名称:技能与专长(skills_and_expertise),数据类型:字符串
- 名称:技能与专长列表(skills_and_expertise_list),数据类型:字符串
- 名称:爱好与兴趣(hobbies_and_interests),数据类型:字符串
- 名称:爱好与兴趣列表(hobbies_and_interests_list),数据类型:字符串
- 名称:职业目标与抱负(career_goals_and_ambitions),数据类型:字符串
- 名称:性别(sex),数据类型:字符串
- 名称:年龄(age),数据类型:整数(int64)
- 名称:婚姻状况(marital_status),数据类型:字符串
- 名称:教育水平(education_level),数据类型:字符串
- 名称:本科专业(bachelors_field),数据类型:字符串
- 名称:职业(occupation),数据类型:字符串
- 名称:城市(city),数据类型:字符串
- 名称:州(state),数据类型:字符串
- 名称:邮政编码(zipcode),数据类型:字符串
- 名称:国家(country),数据类型:字符串
拆分集:
- 名称:训练集(train),数据字节数:5328684597,样本数量:1000000
下载大小:2686692730,数据集总大小:5328684597
=========================================================================
<center>
<img src="images/nemotron_persona_approach.png" alt="Nemotron-Personas-USA" width="400px">
<p><em>基于真实世界分布的人物设定复合人工智能方案</em></p>
</center>
# v1.1 版本更新
v1.1版本引入以下更新:
* 改用`openai/gpt-oss-120b`模型替代`mistralai/Mixtral-8x22B-v0.1`模型,以提升数据质量与多样性
* 将记录数量从10万提升至100万,总令牌数达9.4亿
* 将数据集名称更新为Nemotron-Personas-USA,以便与[Nemotron-Personas 合集](https://huggingface.co/collections/nvidia/nemotron-personas)中的其他区域专属数据集区分。
# 数据集概览
Nemotron-Personas-USA是一款基于知识共享署名4.0许可(CC BY 4.0)的开源合成人物设定数据集,其数据贴合真实人口统计、地理与人格特质分布,旨在体现美国人口的多样性与丰富性。这是首个在姓名、性别、年龄、背景、婚姻状况、教育程度、职业与居住地等多类属性上贴合真实统计数据的同类数据集。本次初始发布聚焦美国市场,可为各类建模用例提供高质量人物设定。
本数据集可用于提升合成数据的多样性、缓解数据/模型偏见并防止模型坍塌。与过往人物设定数据集相比,本数据集在多维度上更贴合底层人口统计分布,包括年龄(如老年人物设定)、地理分布(如乡村人物设定)、教育水平、职业与族裔等。
本数据集由[NeMo 数据设计器(NeMo Data Designer)](https://docs.nvidia.com/nemo/microservices/latest/generate-synthetic-data/index.html)生成,这是一款面向企业级的合成数据生成复合人工智能系统。数据集依托专有概率图模型(PGM),结合采用Apache-2.0许可的`openai/gpt-oss-120b`模型,以及集成在数据设计器中的不断扩充的验证器与评估器套件。NeMo Data Designer本身也可使用本数据集的扩展版本。
本数据集可免费用于商业与非商业用途。
## 数据集未包含内容
鉴于本数据集聚焦人物设定,因此未包含数据设计器中提供的其他字段,例如姓名的名、中间名、姓氏以及合成地址。此外,本数据集未涵盖与企业客户核心业务相关的人物设定(如金融、医疗领域)。如需探索企业级应用场景,请[联系我方](https://www.nvidia.com/en-us/data-center/products/ai-enterprise/contact-sales/)。
本数据集所有数据均为人工生成,尽管其贴合真实世界分布。任何与真实人物(无论在世或已故)的姓名或人物设定描述的相似性均纯属巧合。
# 数据开发方
英伟达公司(NVIDIA Corporation)
# 发布日期
Hugging Face 2025年6月9日,发布地址:https://huggingface.co/datasets/nvidia/Nemotron-Personas
# 数据集创建日期
2025年6月9日
# 许可与使用条款
本数据集采用[知识共享署名4.0国际许可协议](https://creativecommons.org/licenses/by/4.0/deed.en)(CC BY 4.0)。
# 适用场景
面向主权人工智能开发、大语言模型(LLM)训练,以及寻求提升合成数据多样性、缓解数据/模型偏见并防止模型坍塌的开发者。
# 数据版本
1.0(2025年6月9日)
1.1(2025年10月28日)
# 预期用途
Nemotron-Personas-USA数据集旨在供社区用于持续改进开源模型并推动技术前沿发展。数据可自由用于任何模型的训练。我们欢迎开源社区的反馈,并邀请开发者、研究人员与数据爱好者探索本数据集并基于其开展后续工作。
Nemotron-Personas-USA数据集基于美国人口普查局自行上报的人口统计数据分布构建。其核心目标是解决当前模型训练数据中存在的数据缺失与潜在偏见问题,尤其是在合成数据生成所用的现有人物设定数据集方面。尽管本数据集在数据多样性与贴合美国人口真实情况方面有所改进,但我们仍受限于数据可用性与合理的模型复杂度,因此不得不做出一些必要的独立性假设,例如假设在给定教育水平、年龄与性别的前提下,职业与地理位置(邮政编码)相互独立。同样,美国人口普查局未提供与性别无关的全面性别统计数据。我们将在未来的工作中进一步提升数据集的贴合度。
请注意,本数据集仅包含成年人群体。
# 数据集详情
本数据集包含:
* 100万条记录,共600万个人物设定(包含6个人物设定字段与16个上下文字段)
* 约9.36亿令牌,其中包含约3.71亿个人物设定令牌
* 覆盖全美50个州、波多黎各与美属维尔京群岛的2.9万个地理区域(ZCTAs)与1.52万个城市
* 97万个唯一完整姓名
* 560余种职业,均贴合真实世界分布
* 全面覆盖人口统计与人格特质分布
## 种子数据
为体现美国人口的社会人口统计与地理多样性及复杂性,Nemotron-Personas-USA采用了以下开源(采用CC0许可)的聚合统计数据:
* 美国人口普查局的[美国社区调查](https://catalog.data.gov/dataset/american-community-survey-5-year-estimates-data-profiles-5-year)数据
* Rosenman等人(2023)的研究《Race and ethnicity data for first, middle, and surnames》,具体数据集可参见[此处](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/SGKW0K)。
## 数据Schema
本数据集包含22个字段:6个人物设定字段与16个上下文字段,详情如下。研究人员可利用众多上下文字段精准定位特定人物设定,这是现有数据集难以实现的。
<center>
<img src="images/nemotron_personas_schema.png" width="700px">
</center>
## 字段与令牌统计
100万条记录、22个字段,共包含9.4亿令牌(排除全局唯一标识符)。请注意数据覆盖全美50个州、波多黎各与美属维尔京群岛。
<center>
<img src="images/nemotron_personas_field_stats.png" width="500px">
</center>
# 数据集描述与质量评估
以下分析对数据集的多个维度进行拆解,以凸显内置的数据多样性与模式复杂度。
## 姓名字段
由于本数据集聚焦人物设定,因此未将姓名作为独立字段提供。但在人物设定提示词中,嵌入了来自[Rosenman等人(2023)](https://www.nature.com/articles/s41597-023-02202-2)的13.6万个唯一名字、12.6万个唯一中间名与33.8万个唯一姓氏。
## 年龄分布
本数据集的人物年龄分布呈凸起的人口金字塔形状,反映了历史出生率、死亡率趋势与移民模式。这与仅由大语言模型(LLM)生成的典型钟形曲线分布截然不同。整体分布呈右偏态,明显不符合高斯分布。请注意,本数据集未包含未成年人(详见下文伦理部分)。
<center>
<img src="images/nemotron_personas_age_group_distribution.png" width="600px">
</center>
## 分年龄组婚姻状况
以下热图展示了各年龄组人群的婚姻状况占比:(1) 从未结婚、(2) 已婚、(3) 分居、(4) 离婚、(5) 丧偶。该图凸显了婚姻状况在人生历程中的变化:“从未结婚”在青少年晚期与二十岁早期占主导,“已婚”比例在二十岁快速攀升并在四十岁中期达到峰值,离婚与丧偶比例则在人生后期显著升高。所有这些因素均对人生经历与人物设定具有重要影响。
<center>
<img src="images/nemotron_personas_marital_status_distribution.png" width="600px">
</center>
## 分年龄组教育水平
以下热图展现了各年龄组教育程度的复杂模式。例如,请注意仅接受高中教育及未取得文凭的人群比例在老年群体中先下降后回升,这反映了历史上教育机会与社会规范的变化。
<center>
<img src="images/nemotron_personas_education_distribution.png" width="600px">
</center>
## 教育水平的地理分布特征
本数据集切片展示了地理因素如何影响教育水平,进而影响人物设定。以下等值区域图展示了美国各州25岁及以上居民中至少拥有学士学位的人群占比。我们的测试表明,尚无任何大语言模型能够生成如此贴合真实情况的数据。
<center>
<img src="images/nemotron_personas_education_map.png" width="700px">
<p><em>左图:Nemotron-Personas-USA数据集。右图:[美国教育程度,维基百科](https://en.wikipedia.org/wiki/Educational_attainment_in_the_United_States)</em></p>
</center>
## 职业类别
以下树状图展现了本数据集在人物职业方面的丰富性。本数据集包含超过560个职业类别,且进一步贴合人口统计与地理分布特征。
<center>
<img src="images/nemotron_personas_occupation_tree_map.png" width="600px">
</center>
## 人物设定多样性
上述(以及更多)属性最终会影响合成人物设定的多样性。以下分析以职业人物设定描述为例,展示了其中的多个聚类。这些聚类通过对嵌入向量进行聚类并将维度降至2维后识别得到。
<center>
<img src="images/nemotron_personas_professional_personas_clustering.png" width="600px">
</center>
# 使用方法
您可通过以下代码加载本数据集:
python
from datasets import load_dataset
nemotron_personas = load_dataset("nvidia/Nemotron-Personas-USA")
# 数据集特征描述
## 数据收集方法
* 混合模式:人工、合成、自动化
## 标注方法
* 不适用
## 数据集格式
* 文本
## 数据集量化统计
* 记录数量:100万条记录(含600万个人物设定描述)
* 总存储容量:2.6GB
# 伦理考量
英伟达(NVIDIA)认为[可信人工智能(Trustworthy AI)](https://www.nvidia.com/en-us/ai-data-science/trustworthy-ai/)是一项共同责任,我们已建立相关政策与实践规范,以支持各类人工智能应用的开发。开发者在按照我方服务条款下载或使用本数据集时,应与内部团队协作,确保本数据集符合相关行业与用例的要求,并应对潜在的产品误用问题。
请通过[此链接](https://www.nvidia.com/en-us/support/submit-security-vulnerability/)报告安全漏洞或英伟达人工智能相关问题。
# 引用方式
如您使用本数据集,请引用以下内容:
@software{nvidia/Nemotron-Personas-USA,
author = {Meyer, Yev and Corneil, Dane},
title = {{Nemotron-Personas-USA}: Synthetic Personas Aligned to Real-World Distributions
},
month = {June},
year = {2025},
url = {https://huggingface.co/datasets/nvidia/Nemotron-Personas-USA}
}
提供机构:
zhenjuan



