five

Nemotron-Personas-USA

收藏
魔搭社区2025-11-27 更新2025-11-03 收录
下载链接:
https://modelscope.cn/datasets/nv-community/Nemotron-Personas-USA
下载链接
链接失效反馈
官方服务:
资源简介:
Nemotron-Personas-USA ========================================================================= <center> <img src="images/nemotron_persona_approach.png" alt="Nemotron-Personas-USA" width="400px"> <p><em>A compound AI approach to personas grounded in real-world distributions</em></p> </center> # v1.1 Update The v1.1 update introduces the following changes: * leverage `openai/gpt-oss-120b` model instead of `mistralai/Mixtral-8x22B-v0.1` model to improve data quality and diversity * increase the number of records from 100k to 1M, for a total of 0.94B tokens * update the dataset name to Nemotron-Personas-USA in order to differentiate it from other region-specific datasets in the [Nemotron-Personas collection](https://huggingface.co/collections/nvidia/nemotron-personas). # Dataset Overview Nemotron-Personas-USA is an open-source (CC BY 4.0) dataset of synthetically-generated personas grounded in real-world demographic, geographic and personality trait distributions to capture the diversity and richness of the population. It is the first dataset of its kind aligned with statistics for names, sex, age, background, marital status, education, occupation and location, among other attributes. With an initial release focused on the United States, this dataset provides high-quality personas for a variety of modeling use-cases. The dataset can be used to improve diversity of synthetically-generated data, mitigate data/model biases, and prevent model collapse. In particular, the dataset is designed to be more representative of underlying demographic distributions along multiple axes, including age (e.g. older personas), geography (e.g., rural personas), education, occupation and ethnicity, as compared to past persona datasets. Produced using [NeMo Data Designer](https://docs.nvidia.com/nemo/microservices/latest/generate-synthetic-data/index.html), an enterprise-grade compound AI system for synthetic data generation, the dataset leverages a proprietary Probabilistic Graphical Model (PGM) along with an Apache-2.0-licensed `openai/gpt-oss-120b` model and an ever-expanding set of validators and evaluators built into Data Designer. An extended version of Nemotron-Personas-USA is available for use in NeMo Data Designer itself. This dataset is ready for commercial/non-commercial use. ## What is NOT in the dataset Given the emphasis on personas, the dataset excludes other fields available in Data Designer, e.g., first/middle/last names and synthetic addresses. Also excluded are personas generally of relevance to enterprise clients (e.g., finance, healthcare). Please [reach out](https://www.nvidia.com/en-us/data-center/products/ai-enterprise/contact-sales/) to explore enterprise use-cases. All data, while mirroring real-world distributions, is completely artificially generated. Any similarity in names or persona descriptions to actual persons, living or dead, is purely coincidental. # Data Developer NVIDIA Corporation # Release Date Hugging Face 06/09/2025 via https://huggingface.co/datasets/nvidia/Nemotron-Personas # Dataset Creation Date 06/09/2025 # License/Terms of Use This dataset is licensed under the [Creative Commons Attribution 4.0 International License](https://creativecommons.org/licenses/by/4.0/deed.en) (CC BY 4.0). # Use Case Developers working on Sovereign AI, training LLMs, and/or looking to improve diversity of synthetically generated data, mitigate data/model biases, and prevent model collapse. # Data Version 1.0 (06/09/2025) 1.1 (10/28/2025) # Intended use The Nemotron-Personas-USA dataset is intended to be used by the community to continue to improve open models and push the state of the art. The data may be freely used to train any model. We welcome feedback from the open-source community and invite developers, researchers, and data enthusiasts to explore the dataset and build upon it. The Nemotron-Personas-USA dataset is grounded in distributions of self-reported demographic data in the US Census. As such, its primary goal is to combat missing data and/or potential biases present in model training data today, especially when it comes to existing persona datasets used in synthetic data generation. Despite the improved data diversity and fidelity to the US population, we are still limited by data availability and reasonable model complexity. This results in some necessary independence assumptions; for instance, that occupations are independent of location (zip code) given education, age and sex. Similarly, comprehensive statistics on gender, independent of sex, are not available from the Census Bureau. We leave further efforts to improve fidelity to future work. Note that the dataset is focused on adults only. # Dataset Details The dataset contains: * 6M personas across 1M records (6 persona fields & 16 contextual fields) * ~ 936M tokens, including ~371M persona tokens * 29k geographic areas (ZCTAs) and 15.2k cities across all 50 states + Puerto Rico and Virgin Islands * 970k unique full names * 560+ professional occupations, all grounded in real-world distributions * Comprehensive coverage across demographic and personality trait distributions ## Seed Data In order to capture the socio-demographic and geographic diversity and complexity of the US population, Nemotron-Personas-USA leveraged open-source ([CC0-licensed](https://creativecommons.org/public-domain/cc0/)) aggregated statistical data from * The US Census Bureau, specifically the [American Community Survey](https://catalog.data.gov/dataset/american-community-survey-5-year-estimates-data-profiles-5-year). * The study “Race and ethnicity data for first, middle, and surnames,” [Rosenman et al. (2023)](https://www.nature.com/articles/s41597-023-02202-2); specifically, the dataset located [here](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/SGKW0K). ## Schema The dataset includes 22 fields: 6 persona fields and 16 contextual fields shown below. Researchers will find many contextual fields useful in zoning in on specific personas, which is challenging to do with existing datasets. <center> <img src="images/nemotron_personas_schema.png" width="700px"> </center> ## Field & Token Counts 0.94B tokens across 1M records and 22 columns, excluding the globally unique identifier. Note that data covers 50 states as well as Puerto Rico and Virgin Islands. <center> <img src="images/nemotron_personas_field_stats.png" width="500px"> </center> # Dataset Description & Quality Assessment The analysis below provides a breakdown across various axes of the dataset to emphasize the built-in diversity and pattern complexity of data. ## Names Since the focus of this dataset is on personas, names aren’t provided as dedicated fields. However, infused into persona prompts are 136,000 unique first_names, 126,000 unique middle names, and 338,000 unique surnames sourced from [Rosenman et al. (2023)](https://www.nature.com/articles/s41597-023-02202-2). ## Age distribution The distribution of our persona ages takes the form of a bulging population pyramid that reflects historical birth rates, mortality trends, and migration patterns. This is in stark contrast to a bell curve distribution typically produced by an LLM alone. Overall the distribution is right-skewed and distinctly non-Gaussian. Note that minors are excluded from this dataset (see the Ethics section below). <center> <img src="images/nemotron_personas_age_group_distribution.png" width="600px"> </center> ## Marital Status by Age Group The heatmap below displays the fraction of people for each age cohort who are (1) never married, (2) currently married, (3) separated, (4) divorced, or (5) widowed. It highlights how marital status shifts over the life course in the US with “never married” dominating late teens and early twenties, “married” climbing rapidly in twenties and peaking in mid-fourties, divorced and widowed being much more pronounced in later stages of life. All of these considerations are of relevance to informing life experiences and personas. <center> <img src="images/nemotron_personas_marital_status_distribution.png" width="600px"> </center> ## Education Level by Age Group The heatmap below captures intricate patterns of educational attainment across age cohorts. For example, note how the share of high-school-only and no-diploma individuals ebbs then resurges among the oldest age groups, reflecting historical shifts in access and in social norms. <center> <img src="images/nemotron_personas_education_distribution.png" width="600px"> </center> ## Geographic Intricacies of Education Attainment This slice of our dataset demonstrates how geography informs education and therefore persona descriptions. The choropleth map shows, for each U.S. state, the share of residents age 25 and older who hold at least a bachelor’s degree. No LLM in our testing was able to generate data of this fidelity. <center> <img src="images/nemotron_personas_education_map.png" width="700px"> <p><em>Left: Nemotron-Personas-USA dataset. Right: <a href="https://en.wikipedia.org/wiki/Educational_attainment_in_the_United_States">Educational attainment in the United States, Wikipedia</a></em></p> </center> ## Occupational Categories The treemap below reflects the richness of our dataset with respect to professional occupations of personas. Represented in our dataset are over 560 occupation categories that are further informed by demographic and geographic distributions. <center> <img src="images/nemotron_personas_occupation_tree_map.png" width="600px"> </center> ## Persona diversity The attributes above (and many more) ultimately affect the diversity of the synthetic personas being generated. As an example, the analysis below highlights a multitude of clusters within professional persona descriptions. These clusters are identified by clustering embeddings and reducing dimensionality to 2D. <center> <img src="images/nemotron_personas_professional_personas_clustering.png" width="600px"> </center> # How to use it You can load the dataset with the following lines of code. ```python from datasets import load_dataset nemotron_personas = load_dataset("nvidia/Nemotron-Personas-USA") ``` # Dataset Characterization ## Data Collection Method * Hybrid: Human, Synthetic, Automated ## Labeling Method * Not Applicable ## Dataset Format * Text ## Dataset Quantification * Record counts: 1M records (6M persona descriptions) * Total data storage: 2.6GB # Ethical Considerations: NVIDIA believes [Trustworthy AI](https://www.nvidia.com/en-us/ai-data-science/trustworthy-ai/) is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal teams to ensure this dataset meets requirements for the relevant industry and use case and addresses unforeseen product misuse. Please report security vulnerabilities or NVIDIA AI concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/). # Citation If you find the data useful, please cite: ``` @software{nvidia/Nemotron-Personas-USA, author = {Meyer, Yev and Corneil, Dane}, title = {{Nemotron-Personas-USA}: Synthetic Personas Aligned to Real-World Distributions }, month = {June}, year = {2025}, url = {https://huggingface.co/datasets/nvidia/Nemotron-Personas-USA} } ```

Nemotron-Personas-USA ========================================================================= <center> <img src="images/nemotron_persona_approach.png" alt="Nemotron-Personas-USA" width="400px"> <p><em>基于真实世界分布的复合AI角色生成方案</em></p> </center> # v1.1 版本更新 v1.1版本更新包含以下变更: * 采用`openai/gpt-oss-120b`模型替代`mistralai/Mixtral-8x22B-v0.1`模型,以提升数据质量与多样性 * 将记录数从10万条增至100万条,总令牌数达9.4亿 * 将数据集更名为Nemotron-Personas-USA,以便与[Nemotron-Personas合集](https://huggingface.co/collections/nvidia/nemotron-personas)中的其他区域专属数据集区分。 # 数据集概览 Nemotron-Personas-USA是一款采用知识共享署名4.0国际许可协议(CC BY 4.0)的开源合成角色数据集,基于真实人口统计、地理与人格特质分布构建,旨在还原美国人口的多样性与丰富性。它是首个在姓名、性别、年龄、背景、婚姻状况、教育程度、职业与居住地等多项属性上贴合真实统计数据的同类数据集。本数据集初始版本聚焦美国市场,可为各类建模场景提供高质量角色数据。 本数据集可用于提升合成数据的多样性、缓解数据/模型偏见,以及防止模型坍塌。相较于过往的角色数据集,本数据集在多维度上更贴合底层人口分布特征,包括年龄(如老年角色)、地理区域(如乡村角色)、教育程度、职业与族裔等。 本数据集由[NeMo数据设计器](https://docs.nvidia.com/nemo/microservices/latest/generate-synthetic-data/index.html)生成——这是一款面向合成数据生成的企业级复合AI系统,依托专有概率图模型(Probabilistic Graphical Model, PGM)、Apache-2.0许可的`openai/gpt-oss-120b`模型,以及集成在数据设计器中的不断扩展的验证器与评估器套件。Nemotron-Personas-USA的扩展版本可直接在NeMo数据设计器中使用。 本数据集可商用或非商用。 ## 数据集未包含的内容 鉴于本数据集聚焦角色构建,因此未包含数据设计器中提供的其他字段,例如名/中间名/姓氏与合成地址。同时未纳入与企业客户相关的角色(如金融、医疗领域角色)。如需探索企业级应用场景,请[联系我们](https://www.nvidia.com/en-us/data-center/products/ai-enterprise/contact-sales/)。 本数据集所有数据虽贴合真实世界分布,但均为完全人工合成。任何角色姓名或描述与现实在世或已故人士的相似之处,均纯属巧合。 # 数据开发方 英伟达公司 # 发布日期 Hugging Face平台发布日期:2025年6月9日,链接:https://huggingface.co/datasets/nvidia/Nemotron-Personas # 数据集创建日期 2025年6月9日 # 许可协议 本数据集采用[知识共享署名4.0国际许可协议](https://creativecommons.org/licenses/by/4.0/deed.en)(CC BY 4.0)。 # 适用场景 面向主权AI开发、大语言模型(LLM/Large Language Model)训练,以及希望提升合成数据多样性、缓解数据/模型偏见、防止模型坍塌的开发者。 # 数据版本 1.0(2025年6月9日) 1.1(2025年10月28日) # 预期用途 Nemotron-Personas-USA数据集旨在供社区用于持续改进开源模型,推动技术前沿发展。数据可免费用于任意模型的训练。我们欢迎开源社区的反馈,并邀请开发者、研究者与数据爱好者探索本数据集并基于其开展后续工作。 Nemotron-Personas-USA数据集基于美国人口普查局的自我申报人口统计数据分布构建。其核心目标是解决当前模型训练数据中存在的数据缺失与潜在偏见问题,尤其是在合成数据生成所用的现有角色数据集领域。尽管本数据集在数据多样性与贴合美国人口真实情况方面已有改进,但仍受限于数据可得性与合理的模型复杂度,因此不得不做出一些必要的独立性假设。例如,假设在给定教育程度、年龄与性别的前提下,职业与地理位置(邮编区划制表区)相互独立。类似地,美国人口普查局未提供独立于生理性别的性别综合统计数据。我们将在未来的工作中进一步提升数据集的保真度。 请注意,本数据集仅包含成年角色。 # 数据集详情 本数据集包含: * 100万条记录,共600万个角色(包含6个角色字段与16个上下文字段) * 约9.36亿令牌,其中角色令牌约3.71亿 * 覆盖全美50个州、波多黎各与美属维尔京群岛的2.9万个地理区域(邮编区划制表区,ZCTAs)与1.52万个城市 * 97万个唯一全名 * 560余种职业类别,均基于真实世界分布 * 全面覆盖人口统计与人格特质分布维度 ## 种子数据 为还原美国人口的社会人口学与地理多样性及复杂性,Nemotron-Personas-USA采用了以下开源(CC0许可)的聚合统计数据: * 美国人口普查局,具体为[美国社区调查](https://catalog.data.gov/dataset/american-community-survey-5-year-estimates-data-profiles-5-year) * 研究《姓氏、中间名与名的种族与族裔数据》[Rosenman等人(2023)](https://www.nature.com/articles/s41597-023-02202-2),具体数据集链接为[此处](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/SGKW0K)。 ## 数据集结构 本数据集包含22个字段:6个角色字段与16个上下文字段,详情如下。研究人员可借助众多上下文字段精准定位特定角色,这是现有数据集难以实现的。 <center> <img src="images/nemotron_personas_schema.png" width="700px"> </center> ## 字段与令牌统计 100万条记录、22个字段,总计0.94亿令牌(不含全局唯一标识符)。本数据集覆盖全美50个州、波多黎各与美属维尔京群岛。 <center> <img src="images/nemotron_personas_field_stats.png" width="500px"> </center> # 数据集描述与质量评估 以下分析对数据集的多个维度进行拆解,以凸显内置的数据多样性与模式复杂度。 ## 名称 鉴于本数据集聚焦角色构建,未将名称设为独立字段。但角色提示中嵌入了源自[Rosenman等人(2023)](https://www.nature.com/articles/s41597-023-02202-2)的13.6万个唯一名字、12.6万个唯一中间名与33.8万个唯一姓氏。 ## 年龄分布 本数据集的角色年龄分布呈隆起的人口金字塔形态,贴合历史出生率、死亡率趋势与移民模式,与仅使用大语言模型生成的典型钟形曲线分布形成鲜明对比。整体分布呈右偏态,明显不符合高斯分布。请注意,本数据集未纳入未成年人(详见下文伦理考量章节)。 <center> <img src="images/nemotron_personas_age_group_distribution.png" width="600px"> </center> ## 分年龄组婚姻状况 下图热图展示了各年龄组人群的婚姻状况占比:(1) 从未结婚、(2) 已婚、(3) 分居、(4) 离婚、(5) 丧偶。该图体现了婚姻状况在人生历程中的变化:“从未结婚”在青少年晚期与二十岁早期占主导,“已婚”比例在二十岁快速攀升并在四十岁中期达到峰值,离婚与丧偶比例则在人生后期显著升高。所有这些特征均为塑造真实人生经历与角色提供了依据。 <center> <img src="images/nemotron_personas_marital_status_distribution.png" width="600px"> </center> ## 分年龄组教育程度 下图热图展现了各年龄组教育获得情况的复杂模式。例如,请注意仅接受高中教育与未获得文凭的人群比例在最年长年龄组中先下降后回升,这反映了历史上教育资源获取与社会规范的变化。 <center> <img src="images/nemotron_personas_education_distribution.png" width="600px"> </center> ## 教育程度的地理分布特征 本数据集的该维度展示了地理因素如何影响教育程度,进而塑造角色特征。下图等值区域图展示了美国各州25岁及以上人群中至少拥有学士学位的人口比例。经测试,现有大语言模型均无法生成如此高保真度的数据。 <center> <img src="images/nemotron_personas_education_map.png" width="700px"> <p><em>左:Nemotron-Personas-USA数据集;右:[美国教育程度](https://en.wikipedia.org/wiki/Educational_attainment_in_the_United_States),维基百科</em></p> </center> ## 职业类别 下图树状图展现了本数据集在角色职业维度上的丰富性。本数据集涵盖560余个职业类别,且各职业的分布均贴合人口统计与地理分布特征。 <center> <img src="images/nemotron_personas_occupation_tree_map.png" width="600px"> </center> ## 角色多样性 上述(以及更多)属性最终决定了合成角色的多样性。如下分析展示了职业角色描述中的多个聚类结果,这些聚类通过对嵌入向量进行聚类并将维度降至2D得到。 <center> <img src="images/nemotron_personas_professional_personas_clustering.png" width="600px"> </center> # 使用方法 可通过以下代码加载本数据集: python from datasets import load_dataset nemotron_personas = load_dataset("nvidia/Nemotron-Personas-USA") # 数据集特征描述 ## 数据收集方法 * 混合模式:人工、合成、自动化 ## 标注方法 * 不适用 ## 数据集格式 * 文本 ## 数据集量化指标 * 记录数:100万条(含600万个角色描述) * 总数据存储量:2.6GB # 伦理考量 英伟达认为[可信AI(Trustworthy AI)](https://www.nvidia.com/en-us/ai-data-science/trustworthy-ai/)是一项共同责任,我们已建立政策与实践规范,以支持各类AI应用的开发。开发者在按照本服务条款下载或使用本数据集时,应与内部团队协作,确保本数据集符合相关行业与应用场景的要求,并应对可能出现的产品误用问题。 如需报告安全漏洞或英伟达AI相关问题,请[点击此处](https://www.nvidia.com/en-us/support/submit-security-vulnerability/)。 # 引用说明 如本数据集对您的工作有所帮助,请引用: @software{nvidia/Nemotron-Personas-USA, author = {Meyer, Yev and Corneil, Dane}, title = {{Nemotron-Personas-USA}: Synthetic Personas Aligned to Real-World Distributions }, month = {June}, year = {2025}, url = {https://huggingface.co/datasets/nvidia/Nemotron-Personas-USA} }
提供机构:
maas
创建时间:
2025-10-29
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作