Jeremydh911/Nemotron-Personas-USA
收藏Hugging Face2026-04-22 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/Jeremydh911/Nemotron-Personas-USA
下载链接
链接失效反馈官方服务:
资源简介:
Nemotron-Personas-USA是一个开源(CC BY 4.0)的合成生成人物数据集,基于真实世界的人口统计、地理和人格特质分布,以捕捉人口的多样性和丰富性。这是首个与姓名、性别、年龄、背景、婚姻状况、教育、职业和位置等属性统计数据对齐的数据集。初始版本专注于美国,为各种建模用例提供高质量的人物描述。数据集可用于提高合成生成数据的多样性,减轻数据/模型偏见,并防止模型崩溃。特别是,与过去的人物数据集相比,该数据集设计为在多个轴上更能代表基础人口统计分布,包括年龄(如老年人)、地理(如农村人物)、教育、职业和民族。数据集使用NVIDIA的NeMo Data Designer创建,结合了专有的概率图模型和Apache-2.0许可的openai/gpt-oss-120b模型,以及内置的验证器和评估器。数据集包含100万条记录,每条记录有6个人物字段和16个上下文字段,总计约9.36亿个标记。数据集适用于商业和非商业用途。
Nemotron-Personas-USA is an open-source (CC BY 4.0) dataset of synthetically-generated personas grounded in real-world demographic, geographic and personality trait distributions to capture the diversity and richness of the population. It is the first dataset of its kind aligned with statistics for names, sex, age, background, marital status, education, occupation and location, among other attributes. With an initial release focused on the United States, this dataset provides high-quality personas for a variety of modeling use-cases. The dataset can be used to improve diversity of synthetically-generated data, mitigate data/model biases, and prevent model collapse. In particular, the dataset is designed to be more representative of underlying demographic distributions along multiple axes, including age (e.g. older personas), geography (e.g., rural personas), education, occupation and ethnicity, as compared to past persona datasets. Produced using NVIDIAs NeMo Data Designer, an enterprise-grade compound AI system for synthetic data generation, the dataset leverages a proprietary Probabilistic Graphical Model (PGM) along with an Apache-2.0-licensed openai/gpt-oss-120b model and an ever-expanding set of validators and evaluators built into Data Designer. The dataset contains 1M records with 6 persona fields and 16 contextual fields, totaling ~936M tokens. The dataset is ready for commercial/non-commercial use.
提供机构:
Jeremydh911



