five

bae0056/Nemotron-Personas-USA

收藏
Hugging Face2026-04-29 更新2026-05-03 收录
下载链接:
https://hf-mirror.com/datasets/bae0056/Nemotron-Personas-USA
下载链接
链接失效反馈
官方服务:
资源简介:
Nemotron-Personas-USA是一个开源(CC BY 4.0)的合成生成人物角色数据集,基于真实世界的人口统计、地理和人格特质分布,以捕捉美国人口的多样性和丰富性。它是首个在名称、性别、年龄、背景、婚姻状况、教育、职业和位置等属性上与统计数据对齐的数据集。数据集包含100万条记录,每条记录有6个人物角色字段和16个上下文字段,总计约9.36亿个令牌。数据集通过NVIDIA的NeMo Data Designer生成,结合了专有的概率图形模型(PGM)和Apache-2.0许可的openai/gpt-oss-120b模型。数据集旨在提高合成生成数据的多样性,减轻数据/模型偏见,防止模型崩溃。

Nemotron-Personas-USA is an open-source (CC BY 4.0) dataset of synthetically-generated personas grounded in real-world demographic, geographic and personality trait distributions to capture the diversity and richness of the population. It is the first dataset of its kind aligned with statistics for names, sex, age, background, marital status, education, occupation and location, among other attributes. The dataset contains 1M records with 6 persona fields and 16 contextual fields, totaling ~936M tokens. Produced using NVIDIAs NeMo Data Designer, the dataset leverages a proprietary Probabilistic Graphical Model (PGM) along with an Apache-2.0-licensed openai/gpt-oss-120b model. The dataset is designed to improve diversity of synthetically-generated data, mitigate data/model biases, and prevent model collapse.
提供机构:
bae0056
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作